Bug #299

RCP 0.7.2, Import, corpus compilation fails on 5M token TEI-BFM corpus

Added by Alexey Lavrentev about 10 years ago. Updated about 10 years ago.

Status:New Start date:08/09/2013
Priority:Normal Due date:
Assignee:Matthieu Decorde % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

System WinXP
RAM : 3,25 Go
TXM.ini:
.txm
-vmargs
-Xms512m
-Xmx1024m

Corpus size: 144 texts / 5 500 000 tokens (source files available at [sherpa]/SpUV/bfm2013)

The import fails at compilation stage (the files are created in the "data" folder but they are all empty). The import log follows :

Chargement des paramètres d'import depuis le fichier : C:\Documents and Settings\alavrent\xml\bfm2013\import.xml
Sauvegarde des paramètres d'importation...
Execution du script : C:\Documents and Settings\alavrent\TXM\scripts\import\bfmLoader.groovy
-- VALIDATION - checking XML source files well-formedness
-- IMPORTER - Reading source files
preparing 144 files for the tokenizer
................................................................................................................................................
Tokenizing 144 files
..............................................................................................................Tokenizer unknown word chars in PsOrne: [’]
..................................
Validating XML of 144 files
................................................................................................................................................
Building xml-tei-txm 144 files
................................................................................................................................................
-- ANNOTATE - Running NLP tools - fro model
Building TT source files (144) from directory C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\txm\BFM2013
................................................................................................................................................
Applying fro.par TreeTagger model on dir: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\treetagger (144 files)
................................................................................................................................................
Building stdoff files (144) from dir:C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\treetagger to C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\annotations
................................................................................................................................................
Injecting stdoff files (144) data from C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\annotations to xml-txm files of C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\txm\BFM2013
................................................................................................................................................
-- COMPILING - Building Search Engine indexes
process 144 files 
................................................................................................................................................
P-attributes: [id, q, sp, pb, lb, orig, sic, abbr, ref, pos, supplied, lang, nametype, fropos, frolemma]
S-attributes: [ab:1+id+rend+n+part+type+lang+subtype, abbr:0, add:0+status+place, anchor:0+id+n, back:0+n, body:0+n, byline:0, c:0+rend, caesura:0, cb:0+n, choice:0, corr:1+cert+rend+type+resp, damage:0+agent, date:0, div:2+id+rend+n+part+type+lang+corresp+subtype, docauthor:0, ex:0, expan:0, foreign:0+rend+lang+n, front:0+n, gap:0+unit+extent+reason+rend+quantity+resp, gloss:0+rend, head:0+id+rend+type+lang, hi:1+rend, item:0, lb:0+id+rend+n+type+ana+ed, list:0, milestone:0+id+unit+rend+n+ed, name:0+rend, num:0, orig:0, p:1+id+rend+n+lang, pb:0+id+rend+n+ed, q:2+id+rend+n+type+lang, quote:0, reg:0+resp, s:0+id+n, seg:0+rend+type+lang, sic:0+rend+resp, sp:0+rend+who+n+subtype, speaker:0+rend+lang, stage:0+type, subst:0+status, supplied:1+cert+source+reason+rend+resp, surplus:0+rend+resp, text:0+id+base+project+titre+genre+msdate+edsci+idbfm+relation+siecle+msnotbefore+notbefore+edville+forme+ednum+auteur+datecompo+datecompolibre+deaf+msnotafter+restrictions+notafter+domaine+dialecte+ssiecle+eddate+morphosynt+edcomm+msdatelibre, title:0+rend, titlepage:0, titlepart:0+type, trailer:0, txmcorpus:0+lang, unclear:0+reason+resp]
-- EDITION
Building editions: 144 files
................................................................................................................................................
Importation terminée : 1h, 4 min et 30 sec (3870085 ms)
Erreur : le corpus ne sera pas chargéBFM2013
Paramètres : C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\import.xml
HTMLc: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\HTML\BFM2013 : true
REGISTRYc: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\registry\bfm2013 : false
DATAc: C:\Documents and Settings\alavrent\TXM\corpora\bfm2013\data\BFM2013 : true
Moteur de recherche lancé en mode mémoire.
Moteur statistique lancé..connecté.
Rechargement des vues...
TXM prêt.

History

#1 Updated by Matthieu Decorde about 10 years ago

I suspect CWB tools to failed on Windows for big corpus.
I'll need to used the WTC file produced to test.

#2 Updated by Matthieu Decorde about 10 years ago

  • Target version set to TXM X.X

Also available in: Atom PDF