Bug #299
RCP 0.7.2, Import, corpus compilation fails on 5M token TEI-BFM corpus
Status: | New | Start date: | 08/09/2013 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | Matthieu Decorde | % Done: | 0% |
|
Category: | Import | Spent time: | - | |
Target version: | TXM X.X |
Description
System WinXP
RAM : 3,25 Go
TXM.ini:
.txm
-vmargs
-Xms512m
-Xmx1024m
Corpus size: 144 texts / 5 500 000 tokens (source files available at [sherpa]/SpUV/bfm2013)
The import fails at compilation stage (the files are created in the "data" folder but they are all empty). The import log follows :
Chargement des paramètres d'import depuis le fichier : C:\Documents and Settings\alavrent\xml\bfm2013\import.xml Sauvegarde des paramètres d'importation... Execution du script : C:\Documents and Settings\alavrent\TXM\scripts\import\bfmLoader.groovy -- VALIDATION - checking XML source files well-formedness -- IMPORTER - Reading source files preparing 144 files for the tokenizer ................................................................................................................................................ Tokenizing 144 files ..............................................................................................................Tokenizer unknown word chars in PsOrne: [] .................................. Validating XML of 144 files ................................................................................................................................................ Building xml-tei-txm 144 files ................................................................................................................................................ -- ANNOTATE - Running NLP tools - fro model Building TT source files (144) from directory C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\txm\BFM2013 ................................................................................................................................................ Applying fro.par TreeTagger model on dir: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\treetagger (144 files) ................................................................................................................................................ Building stdoff files (144) from dir:C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\treetagger to C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\annotations ................................................................................................................................................ Injecting stdoff files (144) data from C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\annotations to xml-txm files of C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\txm\BFM2013 ................................................................................................................................................ -- COMPILING - Building Search Engine indexes process 144 files ................................................................................................................................................ P-attributes: [id, q, sp, pb, lb, orig, sic, abbr, ref, pos, supplied, lang, nametype, fropos, frolemma] S-attributes: [ab:1+id+rend+n+part+type+lang+subtype, abbr:0, add:0+status+place, anchor:0+id+n, back:0+n, body:0+n, byline:0, c:0+rend, caesura:0, cb:0+n, choice:0, corr:1+cert+rend+type+resp, damage:0+agent, date:0, div:2+id+rend+n+part+type+lang+corresp+subtype, docauthor:0, ex:0, expan:0, foreign:0+rend+lang+n, front:0+n, gap:0+unit+extent+reason+rend+quantity+resp, gloss:0+rend, head:0+id+rend+type+lang, hi:1+rend, item:0, lb:0+id+rend+n+type+ana+ed, list:0, milestone:0+id+unit+rend+n+ed, name:0+rend, num:0, orig:0, p:1+id+rend+n+lang, pb:0+id+rend+n+ed, q:2+id+rend+n+type+lang, quote:0, reg:0+resp, s:0+id+n, seg:0+rend+type+lang, sic:0+rend+resp, sp:0+rend+who+n+subtype, speaker:0+rend+lang, stage:0+type, subst:0+status, supplied:1+cert+source+reason+rend+resp, surplus:0+rend+resp, text:0+id+base+project+titre+genre+msdate+edsci+idbfm+relation+siecle+msnotbefore+notbefore+edville+forme+ednum+auteur+datecompo+datecompolibre+deaf+msnotafter+restrictions+notafter+domaine+dialecte+ssiecle+eddate+morphosynt+edcomm+msdatelibre, title:0+rend, titlepage:0, titlepart:0+type, trailer:0, txmcorpus:0+lang, unclear:0+reason+resp] -- EDITION Building editions: 144 files ................................................................................................................................................ Importation terminée : 1h, 4 min et 30 sec (3870085 ms) Erreur : le corpus ne sera pas chargéBFM2013 Paramètres : C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\import.xml HTMLc: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\HTML\BFM2013 : true REGISTRYc: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\registry\bfm2013 : false DATAc: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\data\BFM2013 : true Moteur de recherche lancé en mode mémoire. Moteur statistique lancé..connecté. Rechargement des vues... TXM prêt.
History
#1 Updated by Matthieu Decorde about 10 years ago
I suspect CWB tools to failed on Windows for big corpus.
I'll need to used the WTC file produced to test.
#2 Updated by Matthieu Decorde about 10 years ago
- Target version set to TXM X.X