Bug #1577
Mis à jour par Matthieu Decorde il y a presque 10 ans
h3. A) Unkown XML elements
When a certain number of different XML elements is reached in the sources, some XML elements are not recognized as structures anymore. They are recognized as regular words that are included in the Lexicon, which should never be the case. For a small corpus, typically with only one file, the problem doesn't occur.
Here is the console log of the Compiler step of the XTZ import module:
<pre>
-- Running CWB-encodes...
Word properties: [id, enpos, enlemma, n, type]
Structures: [argument:0+n, back:0+n, bibl:0+n, body:1+n, byline:0+n, cit:0+n, closer:0+n, corr:0+sic+n, date:0+n, dateline:0+n, div:0+type+n, docauthor:0+n, docdate:0+n, docimprint:0+n, doctitle:0+n, emph:0+rend+n, expan:0+abbr+n, floatingtext:0+n, foreign:0+lang+n, front:0+n, head:0+n, hi:0+n, l:0+rend+n, name:0+n, note:0+id+anchored+place+n, opener:0+n, p:0+n, pb:0+id+n, publisher:0+n, pubplace:0+n, q:0+n, ref:0+target+n, signed:0+n, term:0+n, text:0+id+base+project+genre+author+title+pubdate+lang, title:0+n, titlepage:0+n, titlepart:0+type+n, trailer:0+n, txmcorpus:0+lang]
Encoding 5 files...
......
-- Running CWB-makeall...
....
</pre>
Note:
* recursive structures (e.g. div) are not recognized
* some structures (item, epigraph...) are not recognized at all, cf. screen capture below:
!{width: 100%}xtz-compiler-structures.png!
*Source sample to reproduce the bug*
A sample source directory to reproduce the bug: attachment:"nanovwwp-xtz.zip"
h3. B) Threading: mixed output in console and potential incoherence (see hypothesis)
Some import steps seem to be threaded from what we can read in the mixed output of the console (at least Compiler and Pager steps). It should not be the case for this experimental and pedagogical import module. This may also be related to bug A) (see hypothesis).
h3. Hypothesis A
Bug A) and bug B) may be related:
* Compiler initial steps are independant threads
* Each compiler initial step produces a list of structures of the source to be processed for cwb-encode
* Each compiler initial step list of structures is different (one per input file)
* If the cwb-encode step is called with a sublist of structures which is not the union of all the independant threads lists of structures (bug B), we can get the bug A).
MD: threading or not, I can't reproduce the bug with TXM 0.7.8 update (0.7.8.201510261549) with the corpus attached to the ticket.
h3. Solution A
* 1) write a spec for threading import module steps that design:
** what the user should read in the console output to understand that everything is working fine
** that garanties that the output is compatible with next steps (See hypothesis)
* 2) do the code
* 3) make the code debugged debuged by someone else to verify bug A)
h3. Hypothesis B
The threading is effectively disabled in this update (0.7.8.201510261549) but the Groovy launch script has not been updated (since it was already present in the "$TXM/scripts/import" directory because of the previous alpha update). So only the old threaded version was used.
h3. Solution B
Change the script update strategy to avoid using a deprecated import launcher script already present in the "$TXM/scripts/import" directory.
see #1548
When a certain number of different XML elements is reached in the sources, some XML elements are not recognized as structures anymore. They are recognized as regular words that are included in the Lexicon, which should never be the case. For a small corpus, typically with only one file, the problem doesn't occur.
Here is the console log of the Compiler step of the XTZ import module:
<pre>
-- Running CWB-encodes...
Word properties: [id, enpos, enlemma, n, type]
Structures: [argument:0+n, back:0+n, bibl:0+n, body:1+n, byline:0+n, cit:0+n, closer:0+n, corr:0+sic+n, date:0+n, dateline:0+n, div:0+type+n, docauthor:0+n, docdate:0+n, docimprint:0+n, doctitle:0+n, emph:0+rend+n, expan:0+abbr+n, floatingtext:0+n, foreign:0+lang+n, front:0+n, head:0+n, hi:0+n, l:0+rend+n, name:0+n, note:0+id+anchored+place+n, opener:0+n, p:0+n, pb:0+id+n, publisher:0+n, pubplace:0+n, q:0+n, ref:0+target+n, signed:0+n, term:0+n, text:0+id+base+project+genre+author+title+pubdate+lang, title:0+n, titlepage:0+n, titlepart:0+type+n, trailer:0+n, txmcorpus:0+lang]
Encoding 5 files...
......
-- Running CWB-makeall...
....
</pre>
Note:
* recursive structures (e.g. div) are not recognized
* some structures (item, epigraph...) are not recognized at all, cf. screen capture below:
!{width: 100%}xtz-compiler-structures.png!
*Source sample to reproduce the bug*
A sample source directory to reproduce the bug: attachment:"nanovwwp-xtz.zip"
h3. B) Threading: mixed output in console and potential incoherence (see hypothesis)
Some import steps seem to be threaded from what we can read in the mixed output of the console (at least Compiler and Pager steps). It should not be the case for this experimental and pedagogical import module. This may also be related to bug A) (see hypothesis).
h3. Hypothesis A
Bug A) and bug B) may be related:
* Compiler initial steps are independant threads
* Each compiler initial step produces a list of structures of the source to be processed for cwb-encode
* Each compiler initial step list of structures is different (one per input file)
* If the cwb-encode step is called with a sublist of structures which is not the union of all the independant threads lists of structures (bug B), we can get the bug A).
MD: threading or not, I can't reproduce the bug with TXM 0.7.8 update (0.7.8.201510261549) with the corpus attached to the ticket.
h3. Solution A
* 1) write a spec for threading import module steps that design:
** what the user should read in the console output to understand that everything is working fine
** that garanties that the output is compatible with next steps (See hypothesis)
* 2) do the code
* 3) make the code debugged debuged by someone else to verify bug A)
h3. Hypothesis B
The threading is effectively disabled in this update (0.7.8.201510261549) but the Groovy launch script has not been updated (since it was already present in the "$TXM/scripts/import" directory because of the previous alpha update). So only the old threaded version was used.
h3. Solution B
Change the script update strategy to avoid using a deprecated import launcher script already present in the "$TXM/scripts/import" directory.
see #1548