Bug #1505

Updated by Serge Heiden over 4 years ago

Currently the TXT+CSV import module (and clipboard import module) import) creates a "lb" empty structure (milestone) for each line and creates a "p" structure every each 2 empty lines found.

This raw text structural interpretation scheme matches the various raw text types produced by frequently used tools: clipboard text produced by Select is too specific and Copy commands in web browsers, in mail readers or 'Save as text' commands in word processors. missleading.

But this scheme has no standard or norm and doesn't always work, for example with some word processors output.

h3. Solution 1

Don't create the "p" structures.

Add to h3. Solution 2

With an import option "document format" : don't create the "lb" milestone and create a "p" structure
each line read.

h3. Solution 3

word a must contains the "lbn" property which is its line number in the TXT source file.

see ticket #1585

h3. Validation test

the clipboard import of <pre>this is a small test.

With some line breaks


must give the following description:
Description du corpus PRESSEPAPIER1

- pressepapier1
- mdecorde
- 2016-06-29
Statistiques Générales

Nombre de mots 11
Nombre de propriétés de mot 4
Nombre d'unités de structure 3

Propriétés des unités lexicales (max 20 valeurs)

- frlemma : this, is, avoir, small, test, ., With, some, line, break, sometimes, ...
- frpos : NOM, ADJ, VER:pres, SENT, NAM, ...
- lbn : 1, 3, 5, ...
- word : this, is, a, small, test, ., With, some, line, breaks, sometimes, ...

Propriétés des structures (max 20 valeurs)

- s
n (2) = 1, 2.
- text
id (1) = pressepapier1.