Bug #1505: TBX: X.X, TXT+CSV import created structures - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #1505

Mis à jour par Serge Heiden il y a plus de 9 ans

Currently the TXT+CSV import module (and clipboard import module) import) creates a "lb" empty structure (milestone) for each line and creates a "p" structure every each 2 empty lines found.

This raw text structural interpretation scheme matches the various raw text types produced by frequently used tools: clipboard text produced by Select is too specific and Copy commands in web browsers, in mail readers or 'Save as text' commands in word processors. missleading.

But this scheme has no standard or norm and doesn't always work, for example with some word processors output.

h3. Solution 1

Don't create the "p" structures.

Add to h3. Solution 2

With an import option "document format" : don't create the "lb" milestone and create a "p" structure each line read.

h3. Solution 3

the word a must contains the "lbn" property which is its line number in the TXT source file.

see ticket #1585

h3. Validation test

the clipboard import of <pre>this is a small test.

With some line breaks

sometimes</pre>

must give the following description:
<pre>
Description du corpus PRESSEPAPIER1

- pressepapier1
- mdecorde
- 2016-06-29
Statistiques Générales

Nombre de mots 11
Nombre de propriétés de mot 4
Nombre d'unités de structure 3

Propriétés des unités lexicales (max 20 valeurs)

- frlemma : this, is, avoir, small, test, ., With, some, line, break, sometimes, ...
- frpos : NOM, ADJ, VER:pres, SENT, NAM, ...
- lbn : 1, 3, 5, ...
- word : this, is, a, small, test, ., With, some, line, breaks, sometimes, ...

Propriétés des structures (max 20 valeurs)

- s
n (2) = 1, 2.
- text
id (1) = pressepapier1.
</pre>

Retour

Laboratoire ICAR » Plateforme TXM

Bug #1505