Bug #1577

RCP: 0.7.8, XTZ import, some XML elements not recognized as structures, etc.

Added by Alexey Lavrentev over 7 years ago. Updated over 7 years ago.

Status:New Start date:10/26/2015
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.7.8

Description

A) Unkown XML elements

When a certain number of different XML elements is reached in the sources, some XML elements are not recognized as structures anymore. They are recognized as regular words that are included in the Lexicon, which should never be the case. For a small corpus, typically with only one file, the problem doesn't occur.

Here is the console log of the Compiler step of the XTZ import module:

-- Running CWB-encodes...
 Word properties: [id, enpos, enlemma, n, type]
 Structures: [argument:0+n, back:0+n, bibl:0+n, body:1+n, byline:0+n, cit:0+n, closer:0+n, corr:0+sic+n, date:0+n, dateline:0+n, div:0+type+n, docauthor:0+n, docdate:0+n, docimprint:0+n, doctitle:0+n, emph:0+rend+n, expan:0+abbr+n, floatingtext:0+n, foreign:0+lang+n, front:0+n, head:0+n, hi:0+n, l:0+rend+n, name:0+n, note:0+id+anchored+place+n, opener:0+n, p:0+n, pb:0+id+n, publisher:0+n, pubplace:0+n, q:0+n, ref:0+target+n, signed:0+n, term:0+n, text:0+id+base+project+genre+author+title+pubdate+lang, title:0+n, titlepage:0+n, titlepart:0+type+n, trailer:0+n, txmcorpus:0+lang]
Encoding 5 files...
......
-- Running CWB-makeall...
....

Note:
  • recursive structures (e.g. div) are not recognized
  • some structures (item, epigraph...) are not recognized at all, cf. screen capture below:

Source sample to reproduce the bug

A sample source directory to reproduce the bug: nanovwwp-xtz.zip

B) Threading: mixed output in console and potential incoherence (see hypothesis)

Some import steps seem to be threaded from what we can read in the mixed output of the console (at least Compiler and Pager steps). It should not be the case for this experimental and pedagogical import module. This may also be related to bug A) (see hypothesis).

Hypothesis A

Bug A) and bug B) may be related:
  • Compiler initial steps are independant threads
  • Each compiler initial step produces a list of structures of the source to be processed for cwb-encode
  • Each compiler initial step list of structures is different (one per input file)
  • If the cwb-encode step is called with a sublist of structures which is not the union of all the independant threads lists of structures (bug B), we can get the bug A).

MD: threading or not, I can't reproduce the bug with TXM 0.7.8 update (0.7.8.201510261549) with the corpus attached to the ticket.

Solution A

  • 1) write a spec for threading import module steps that design:
    • what the user should read in the console output to understand that everything is working fine
    • that garanties that the output is compatible with next steps (See hypothesis)
  • 2) do the code
  • 3) make the code debugged by someone else to verify bug A)

Hypothesis B

The threading is effectively disabled in this update (0.7.8.201510261549) but the Groovy launch script has not been updated (since it was already present in the "$TXM/scripts/import" directory because of the previous alpha update). So only the old threaded version was used.

Solution B

Change the script update strategy to avoid using a deprecated import launcher script already present in the "$TXM/scripts/import" directory.
see #1548

xtz-compiler-structures.png (12.8 kB) Serge Heiden, 10/26/2015 06:36 pm

nanovwwp-xtz.zip (111.4 kB) Serge Heiden, 10/26/2015 06:59 pm

History

#1 Updated by Serge Heiden over 7 years ago

#2 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)

#3 Updated by Serge Heiden over 7 years ago

  • File deleted (Capture du 2015-10-26 182054.png)

#4 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)

#5 Updated by Serge Heiden over 7 years ago

#6 Updated by Serge Heiden over 7 years ago

  • Subject changed from RCP: 0.7.8, XTZ import, some XML elements not recognized as structures to RCP: 0.7.8, XTZ import, some XML elements not recognized as structures, etc.
  • Description updated (diff)

#7 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)
  • Category set to Import
  • Target version set to TXM 0.7.8

#8 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)

#9 Updated by Matthieu Decorde over 7 years ago

threading or not, I can't reproduce the bug with TXM 0.7.8 update : 0.7.8.201510261549

The threading is disabled in this update, in the previous alpha update the threading was enabled.
But the launch Groovy script has not been replaced since it was already present in the "$TXM/scripts/import" directory.

#10 Updated by Alexey Lavrentev over 7 years ago

The test corpus is not sufficient to reproduce the bug. You need to add 4-5 more texts from VWWP corpus for the problem to appear.

#11 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#12 Updated by Matthieu Decorde over 7 years ago

  • % Done changed from 0 to 80

Also available in: Atom PDF