Support #691

GL: CWB and XML/W import fails

Ajouté par Matthieu Decorde il y a plus de 4 ans. Mis à jour il y a environ 4 ans.

Statut:New Début:20/03/2014
Priorité:Normal Echéance:
Assigné à:- % réalisé:

0%

Catégorie:Import Temps passé: -
Version cible:Support

Description

GL - 05/03/2014

Le corpus représentait 66 M mots. Dans TXM.ini, j'avais:
-Xms1024m
-Xmx2048m

La première tentative d'import XML+CSV se plante de cette façon:

module UI parameters: {editions=true, parallel=false, xslt=true, pAttributes=true, encoding=false, sAttributes=true, queries=true, preBuild=true, tokenizer=true, lang=true}
Saving import parameters...
Script execution : /home/glux/TXM/scripts/import/xmlLoader.groovy
Trying to read import properties file: /home/glux/Corpora/Europarl/import.properties
new tFactory: net.sf.saxon.TransformerFactoryImpl@1440bc05
ApplyXsl2 from file: /home/glux/Documents/filtertei.xsl
new transformer: net.sf.saxon.Controller@54fb0060
-- Apply xsl /home/glux/Documents/filtertei.xsl with parameters: {}
..........................................................................................................................................
P-attributes: [id, n, type]
S-attributes: [body:0+n, chapter:0+id+n, front:0+n, p:0+n, speaker:1+id+AFFILIATION+NAME+LANGUAGE+name+ID+language+n, text:0+id+base+project+date+lang+legislature, txmcorpus:0+lang]
Starting process with command: /usr/lib/TXM/TXM/../cwb/bin/cwb-encode -d /home/glux/TXM/corpora/europarl/data/EUROPARL -f /home/glux/TXM/corpora/europarl/wtc/EUROPARL.wtc -R /home/glux/TXM/corpora/europarl/registry/europarl -c utf8 -xsB -P id -P n -P type -S body:0+n -S chapter:0+id+n -S front:0+n -S p:0+n -S speaker:1+id+AFFILIATION+NAME+LANGUAGE+name+ID+language+n -S text:0+id+base+project+date+lang+legislature -S txmcorpus:0+lang
Error: The registry file was not created: /home/glux/TXM/corpora/europarl/registry/europarl. See https://groupes.renater.fr/wiki/txm-users/public/faq
import process stopped

Import done: 19h, 58min and 29sec (71909856 ms)
The import process failed.
Starting NullSearchEngineServer: [/usr/lib/TXM/cwb/bin/cqpserver, -I, /usr/lib/TXM/cwb/cqpserver.init, -r, /home/glux/TXM/registry, -b, 1000000, -d, OFF, -P, 4877] ...
Running SearchEngine in memory mode.
Stopping process: CWB_ENCODE
Process stoped: CWB_ENCODE
Stopping process: CWB_MAKE_ALL

Une deuxième tentative avec l'import CWB (avec seulement un wtc puisque le registry n'a pas été créé), donne ceci:
Load import parameters from file: /home/glux/Corpora/Europarl-CWB/import.xml
Params: BaseParameters [name=europarlcwb, date=Wed Jan 14 00:03:00 CET 5, author=glux, version=0.7, description=,
links={}, corpora={EUROPARLCWB=[corpus: null]},
root=[import: null], corporaElement=[corpora: null]]
module UI parameters: {editions=true, parallel=false, xslt=false, pAttributes=true, encoding=true, sAttributes=true, queries=true, preBuild=true, tokenizer=true, lang=true}
Saving import parameters...
Script execution : /home/glux/TXM/scripts/import/wtcLoader.groovy
-- COMPILING - Building Search Engine indexes
WARNING: No registry file in source directory
We'll use automatic positional attributes and structural attributes...
Error while running script: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[38197,2]
Message: The entity name must immediately follow the '&' in the entity reference.
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[38197,2]
Message: The entity name must immediately follow the '&' in the entity reference.
at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:592)
at org.txm.importer.cwb.BuildCwbEncodeArgs.process(BuildCwbEncodeArgs.java:99)
at org.txm.importer.cwb.BuildCwbEncodeArgs$process.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
at org.txm.importer.wtc.compiler.run(compiler.groovy:144)
at org.txm.importer.wtc.compiler$run.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
at org.txm.importer.wtc.wtcLoader.run(wtcLoader.groovy:102)
at groovy.util.GroovyScriptEngine.run(GroovyScriptEngine.java:551)
at org.txm.rcpapplication.commands.ExecuteImportScript$1.run(ExecuteImportScript.java:198)
at org.eclipse.core.internal.jobs.Worker.run(Worker.java:53)
Starting NullSearchEngineServer: [/usr/lib/TXM/cwb/bin/cqpserver, -I, /usr/lib/TXM/cwb/cqpserver.init, -r, /home/glux/TXM/registry, -b, 1000000, -d, OFF, -P, 4877] ...
Running SearchEngine in memory mode.
Stopping process: CWB_ENCODE
Stopping process: CWB_MAKE_ALL

Historique

#1 Mis à jour par Matthieu Decorde il y a plus de 4 ans

  1. Need to re-run cwb-encode to know why registry file was not created :
    /usr/lib/TXM/TXM/../cwb/bin/cwb-encode -d $HOME/TXM/corpora/europarl/data/EUROPARL -f $HOME/TXM/corpora/europarl/wtc/EUROPARL.wtc -R $HOME/TXM/corpora/europarl/registry/europarl -c utf8 -xsB -P id -P n -P type -S body:0+n -S chapter:0+id+n -S front:0+n -S p:0+n -S speaker:1+id+AFFILIATION+NAME+LANGUAGE+name+ID+language+n -S text:0+id+base+project+date+lang+legislature -S txmcorpus:0+lang
    
  1. During CWB import the WTC file is parsed as XML. But the entities are not coded and "&" generate an error and stop the import.

#2 Mis à jour par Matthieu Decorde il y a plus de 4 ans

  • Sujet changé de TBX: 0.7.5: CWB and XML/W import fails à GIANCARLO CWB and XML/W import fails
  • Version cible mis à Support

#3 Mis à jour par Serge Heiden il y a environ 4 ans

  • Sujet changé de GIANCARLO CWB and XML/W import fails à GL: CWB and XML/W import fails
  • Description mis à jour (diff)

Formats disponibles : Atom PDF