Bug #3030

Import, TMX, text lang is not used

Ajouté par Alexey Lavrentev il y a plus de 4 ans. Mis à jour il y a plus d'un an.

Statut:Closed Début:02/03/2021
Priorité:High Echéance:
Assigné à:- % réalisé:

100%

Catégorie:Import Temps passé: -
Version cible:TXM 0.8.2

Description

Bug reported for TXM 0.8.1.202102250837

Languages are declared explicitly in TMX files, these codes should be used in order to select the appropriate language model.
TXM uses the language selected in the import form. If the "guess" option is chosen, many errors occur because language models are not found.
The language selection should be desactivated for this module.

Broken since TXM 0.8.0

Révisions associées

Révision 3051
Ajouté par Matthieu Decorde il y a plus de 4 ans

fix TMX fr-pos annotation per lang refs #3030

Historique

#1 Mis à jour par Alexey Lavrentev il y a plus de 4 ans

Copy of the console when importing a Russian-English TMX corpus. It is very strange that in both cases the guessed language is Hungarian.

Sauvegarde des paramètres d'importation…
Corpus "MEDPARCORP_RU0" supprimé(e).
Corpus "MEDPARCORP_EN1" supprimé(e).
Compiling tmx import module...
Démarrage du module d'import "tmx"...
-- IMPORTER - Reading source files
initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-2.tmx
add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES]
skip file : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/import.xml
initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-5.tmx
add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES]
Tokenizing 4 files
....
Building xml-tei-txm (4 files)
....
-- ANNOTATE - Running NLP tools
004 ..
ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par
aborted.
Process exited abnormally with code 1 at Tuesday, 2 March 2021
Args: 
/usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-5_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt 
java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt (Aucun fichier ou dossier de ce type)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283)
    at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source)
    at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297)
    at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131)
    at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558)
    at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104)
    at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74)
    at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132)
    at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source)
    at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91)
    at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129)
    at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56)
    at org.txm.objects.Project._compute(Project.java:413)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2395)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2282)
    at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27]
Message: Les structures de document XML doivent commencer et se terminer dans la même entité.
.
ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par
aborted.
Process exited abnormally with code 1 at Tuesday, 2 March 2021
Args: 
/usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-2_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt 
java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt (Aucun fichier ou dossier de ce type)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283)
    at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source)
    at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297)
    at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131)
    at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558)
    at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104)
    at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74)
    at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132)
    at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source)
    at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91)
    at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129)
    at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56)
    at org.txm.objects.Project._compute(Project.java:413)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2395)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2282)
    at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27]
Message: Les structures de document XML doivent commencer et se terminer dans la même entité.
.
langs : [korpus-2_0.xml:RU, korpus-2_1.xml:EN, korpus-5_0.xml:RU, korpus-5_1.xml:EN]
texts : [0:[korpus-2_0.xml, korpus-5_0.xml], 1:[korpus-2_1.xml, korpus-5_1.xml]]
-- COMPILING - Building Search Engine indexes
Using corpus ID: [0:RU0, 1:EN1]
....
P-attributes: [id, ref, xxpos, xxlemma]
S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang]
P-attributes: [id, ref]
S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang]
Writing align.out with 428 positions.
Encoding alignment for [medparcorp_en1, medparcorp_ru0] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out
Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_EN1/medparcorp_ru0.alx ...
I skipped 0 0:1 alignments and 0 1:0 alignments.
Writing align.out with 428 positions.
Encoding alignment for [medparcorp_ru0, medparcorp_en1] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out
Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_RU0/medparcorp_en1.alx ...
I skipped 0 0:1 alignments and 0 1:0 alignments.
-- EDITION - Building edition
..
..
Import terminé en 1.2 sec (1243 ms).
Import terminé.

#2 Mis à jour par Matthieu Decorde il y a plus de 4 ans

  • Sujet changé de Import, TMX, Language recognition for annotation not working à Import, TMX, text lang is not used
  • Description mis à jour (diff)
  • Catégorie mis à Import
  • Priorité changé de Normal à High

#3 Mis à jour par Matthieu Decorde il y a plus de 4 ans

  • % réalisé changé de 0 à 80

#4 Mis à jour par Sebastien Jacquot il y a plus d'un an

  • % réalisé changé de 80 à 100

#5 Mis à jour par Sebastien Jacquot il y a plus d'un an

  • Statut changé de New à Closed

Formats disponibles : Atom PDF