Bug #1444

RCP: 0.7.7 Ubuntu1404_64, XML-TMX import module broken

Added by Serge Heiden over 3 years ago. Updated almost 3 years ago.

Status:New Start date:08/25/2015
Priority:Immediate Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.7.8

Description

Import of the UNOSAMPLE demo corpus (source in ///SpUV/uno-sample) does not terminate anymore with following console messages:

Sauvegarde des paramètres d'importation...
 Tokenizer parametrized with whitespaces=[\p{Z}\p{C}]+
 Tokenizer parametrized with regPunct=[\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{Po}\p{S}]
 Tokenizer parametrized with punct_strong=[.!?]|\.\.|\.\.\.|…|\|
 Tokenizer parametrized with regElision=['‘’]
Execution du script : /home/sheiden/TXM/scripts/import/tmxLoader.groovy
-- IMPORTER - Reading source files
skip file : /home/sheiden/Corpus/src/uno-sample/import.xml
initialize writers for : /home/sheiden/Corpus/src/uno-sample/uncorpora_20090831-sample-b.tmx
build Writer : 0 en
build Writer : 1 ar
build Writer : 2 zh
build Writer : 3 fr
build Writer : 4 ru
build Writer : 5 es
add header : [creationtool:ORESAligner, creationtoolversion:1.0, datatype:plaintext, segtype:paragraph, adminlang:en-us, srclang:EN, o-tmf:ORES]
initialize writers for : /home/sheiden/Corpus/src/uno-sample/uncorpora_20090831-sample-a.tmx
build Writer : 0 en
build Writer : 1 ar
build Writer : 2 zh
build Writer : 3 fr
build Writer : 4 ru
build Writer : 5 es
add header : [creationtool:ORESAligner, creationtoolversion:1.0, datatype:plaintext, segtype:paragraph, adminlang:en-us, srclang:EN, o-tmf:ORES]
Tokenizing 12 files
............
Building xml-tei-txm (12 files)
............
-- ANNOTATE - Running NLP tools
TT with fr /home/sheiden/TXM/corpora/UNOSAMPLE/txm/uncorpora_20090831-sample-a_3.xml+/home/sheiden/TXM/corpora/UNOSAMPLE/annotations/uncorpora_20090831-sample-a_3.xml-STDOFF.xml > /home/sheiden/TXM/corpora/UNOSAMPLE/ptreetagger/uncorpora_20090831-sample-a_3.xml-src.tt > /home/sheiden/TXM/corpora/UNOSAMPLE/treetagger/uncorpora_20090831-sample-a_3.xml-out.tt
TT with en /home/sheiden/TXM/corpora/UNOSAMPLE/txm/uncorpora_20090831-sample-a_0.xml+/home/sheiden/TXM/corpora/UNOSAMPLE/annotations/uncorpora_20090831-sample-a_0.xml-STDOFF.xml > /home/sheiden/TXM/corpora/UNOSAMPLE/ptreetagger/uncorpora_20090831-sample-a_0.xml-src.tt > /home/sheiden/TXM/corpora/UNOSAMPLE/treetagger/uncorpora_20090831-sample-a_0.xml-out.tt
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/ar.par. Continue import process 
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/es.par. Continue import process 
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/ru.par. Continue import process 
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/ar.par. Continue import process 
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/ru.par. Continue import process 
TT with fr /home/sheiden/TXM/corpora/UNOSAMPLE/txm/uncorpora_20090831-sample-b_3.xml+/home/sheiden/TXM/corpora/UNOSAMPLE/annotations/uncorpora_20090831-sample-b_3.xml-STDOFF.xml > /home/sheiden/TXM/corpora/UNOSAMPLE/ptreetagger/uncorpora_20090831-sample-b_3.xml-src.tt > /home/sheiden/TXM/corpora/UNOSAMPLE/treetagger/uncorpora_20090831-sample-b_3.xml-out.tt
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/zh.par. Continue import process 
TT with en /home/sheiden/TXM/corpora/UNOSAMPLE/txm/uncorpora_20090831-sample-b_0.xml+/home/sheiden/TXM/corpora/UNOSAMPLE/annotations/uncorpora_20090831-sample-b_0.xml-STDOFF.xml > /home/sheiden/TXM/corpora/UNOSAMPLE/ptreetagger/uncorpora_20090831-sample-b_0.xml-src.tt > /home/sheiden/TXM/corpora/UNOSAMPLE/treetagger/uncorpora_20090831-sample-b_0.xml-out.tt
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/zh.par. Continue import process 
No Modelfile available for lang /home/sheiden/Software/TreeTagger/lib/es.par. Continue import process 
langs : [uncorpora_20090831-sample-b_0.xml:en, uncorpora_20090831-sample-b_1.xml:ar, uncorpora_20090831-sample-b_2.xml:zh, uncorpora_20090831-sample-b_3.xml:fr, uncorpora_20090831-sample-b_4.xml:ru, uncorpora_20090831-sample-b_5.xml:es, uncorpora_20090831-sample-a_0.xml:en, uncorpora_20090831-sample-a_1.xml:ar, uncorpora_20090831-sample-a_2.xml:zh, uncorpora_20090831-sample-a_3.xml:fr, uncorpora_20090831-sample-a_4.xml:ru, uncorpora_20090831-sample-a_5.xml:es]
texts : [0:[uncorpora_20090831-sample-b_0.xml, uncorpora_20090831-sample-a_0.xml], 1:[uncorpora_20090831-sample-b_1.xml, uncorpora_20090831-sample-a_1.xml], 2:[uncorpora_20090831-sample-b_2.xml, uncorpora_20090831-sample-a_2.xml], 3:[uncorpora_20090831-sample-b_3.xml, uncorpora_20090831-sample-a_3.xml], 4:[uncorpora_20090831-sample-b_4.xml, uncorpora_20090831-sample-a_4.xml], 5:[uncorpora_20090831-sample-b_5.xml, uncorpora_20090831-sample-a_5.xml]]
-- COMPILING - Building Search Engine indexes
Using corpus ID: [0:en0, 1:ar1, 2:zh2, 3:fr3, 4:ru4, 5:es5]
............
P-attributes: [id, ref]
S-attributes: [hi:0+type, seg:0+id, sub:0+type, text:0+id+base+project, tu:0+tuid+committee+session+vote+lead, txmcorpus:0+id+lang]
Usage error: invalid filename 'UNOSAMPLE_zh2' for registry entry.
Filename must not contain uppercase letters, '.' or '~'.
Error: The registry file was not created: /home/sheiden/TXM/corpora/UNOSAMPLE/registry/UNOSAMPLE_zh2. See https://groupes.renater.fr/wiki/txm-users/public/faq
Compiler failed

Importation terminée : 10 sec (10023 ms)
L'import n'a pas abouti.
Moteur de recherche lancé.

Validation test

Run the import with the UNO sample corpus : smb://ensldfs.ens-lyon.fr/services/Laboratoires/labo_ana_corpus/Projets/Textométrie/SpUV/uno-sample
  • The import should end
  • The concordance of "la" :ONUSAMPLE_EN0 "the" with Corpus UNOSAMPLE_FR3 should return XX results

History

#1 Updated by Serge Heiden over 3 years ago

  • Priority changed from Normal to Immediate

#2 Updated by Matthieu Decorde over 3 years ago

  • % Done changed from 0 to 80

#3 Updated by Matthieu Decorde about 3 years ago

  • Description updated (diff)

#4 Updated by Matthieu Decorde about 3 years ago

  • % Done changed from 80 to 60

Ubuntu 64bit with last CWB binaries: cwb-align failed with asegmentation fault error

#5 Updated by Matthieu Decorde almost 3 years ago

  • % Done changed from 60 to 80

Also available in: Atom PDF