Bug #1505

TBX: X.X, TXT+CSV import created structures

Added by Matthieu Decorde about 4 years ago. Updated over 3 years ago.

Status:New Start date:09/17/2015
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.7.8

Description

Currently the TXT+CSV import module (and clipboard import module) creates a "lb" empty structure (milestone) for each line and creates a "p" structure every 2 empty lines found.

This raw text structural interpretation scheme matches the various raw text types produced by frequently used tools: clipboard text produced by Select and Copy commands in web browsers, in mail readers or 'Save as text' commands in word processors.

But this scheme has no standard or norm and doesn't always work, for example with some word processors output.

Solution 1

Don't create the "p" structures.

Add to each word a "lbn" property which is its line number in the TXT source file.

see ticket #1585

Validation test

the clipboard import of

this is a small test.

With some line breaks

sometimes

must give the following description:

Description du corpus PRESSEPAPIER1

- pressepapier1
- mdecorde
- 2016-06-29
Statistiques Générales

Nombre de mots 11
Nombre de propriétés de mot 4
Nombre d'unités de structure 3

Propriétés des unités lexicales (max 20 valeurs)

- frlemma : this, is, avoir, small, test, ., With, some, line, break, sometimes, ...
- frpos : NOM, ADJ, VER:pres, SENT, NAM, ...
- lbn : 1, 3, 5, ...
- word : this, is, a, small, test, ., With, some, line, breaks, sometimes, ...

Propriétés des structures (max 20 valeurs)

- s
n (2) = 1, 2.
- text
id (1) = pressepapier1.

History

#1 Updated by Matthieu Decorde about 4 years ago

  • % Done changed from 80 to 20

#2 Updated by Matthieu Decorde about 4 years ago

  • Description updated (diff)
  • % Done changed from 20 to 80

#3 Updated by Matthieu Decorde over 3 years ago

  • Description updated (diff)

#4 Updated by Matthieu Decorde over 3 years ago

  • Description updated (diff)

#5 Updated by Serge Heiden over 3 years ago

  • Description updated (diff)

Also available in: Atom PDF