Bug #1577: RCP: 0.7.8, XTZ import, some XML elements not recognized as structures, etc. - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #1577

Mis à jour par Serge Heiden il y a presque 10 ans

h3. A) Unkown Some XML elements

When a certain number of different XML elements is reached in the sources, some XML elements are not recognized as structures anymore. They structures, they are recognized as regular words that are included in the Lexicon, when the a certain number of different XML elements is reached.

B) Some import steps seem to be threaded which should never not be the case. For a small corpus, typically with only one file, the problem doesn't occur. case for this experimental and pedagogical import module (at least Compiler and Pager steps).

Here is the console log of the Compiler step of the XTZ import module:
<pre>
-- Running CWB-encodes...
Word properties: [id, enpos, enlemma, n, type]
Structures: [argument:0+n, back:0+n, bibl:0+n, body:1+n, byline:0+n, cit:0+n, closer:0+n, corr:0+sic+n, date:0+n, dateline:0+n, div:0+type+n, docauthor:0+n, docdate:0+n, docimprint:0+n, doctitle:0+n, emph:0+rend+n, expan:0+abbr+n, floatingtext:0+n, foreign:0+lang+n, front:0+n, head:0+n, hi:0+n, l:0+rend+n, name:0+n, note:0+id+anchored+place+n, opener:0+n, p:0+n, pb:0+id+n, publisher:0+n, pubplace:0+n, q:0+n, ref:0+target+n, signed:0+n, term:0+n, text:0+id+base+project+genre+author+title+pubdate+lang, title:0+n, titlepage:0+n, titlepart:0+type+n, trailer:0+n, txmcorpus:0+lang]
Encoding 5 files...
......
-- Running CWB-makeall...
....
</pre>

Note:
* recursive structures (e.g. div) are not recognized
* some structures (item, epigraph...) are not recognized at all, cf. screen capture below:

!{width: 100%}xtz-compiler-structures.png!

*Source h3. Source sample to reproduce the bug*

A sample source directory to reproduce the bug: attachment:"nanovwwp-xtz.zip"

h3. B) Threading: mixed output in console and potential incoherence (see hypothesis C)

Some import steps seem to be threaded from what we can read in the mixed output of the console (at least Compiler and Pager steps). It should not be the case for this experimental and pedagogical import module. This may also be related to bug A) (see Solution).

h3. Solution

A) find the bug

B) remove the threads
* 1) write a spec for threading threads in import module steps that design what the user should read in the console output to understand that everything is working fine and that garanties that the output is compatible with next steps (See hypothesis C below) modules
* 2) do the code
* 3) make the code debuged by someone else
* 4) publish

C) bug A) and bug B) may be related.

Hypothesis:
* Compiler initial steps are independant threads
* Each compiler initial step produces a list of structures of the source to be processed for cwb-encode
* Each compiler initial step list of structures is different (one per input file)
* If the cwb-encode step is called with a sublist of structures which is not the union of all the independant threads lists of structures, we can get the bug A).

Retour

Laboratoire ICAR » Plateforme TXM

Bug #1577