Bug #3030

Import, TMX, text lang is not used

Added by Alexey Lavrentev about 1 month ago. Updated about 1 month ago.

Status:New Start date:03/02/2021
Priority:High Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.8.2

Description

Bug reported for TXM 0.8.1.202102250837

Languages are declared explicitly in TMX files, these codes should be used in order to select the appropriate language model.
TXM uses the language selected in the import form. If the "guess" option is chosen, many errors occur because language models are not found.
The language selection should be desactivated for this module.

Broken since TXM 0.8.0

Associated revisions

Revision 3051
Added by Matthieu Decorde about 1 month ago

fix TMX fr-pos annotation per lang refs #3030

History

#1 Updated by Alexey Lavrentev about 1 month ago

Copy of the console when importing a Russian-English TMX corpus. It is very strange that in both cases the guessed language is Hungarian.

Sauvegarde des paramètres d'importation…
Corpus "MEDPARCORP_RU0" supprimé(e).
Corpus "MEDPARCORP_EN1" supprimé(e).
Compiling tmx import module...
Démarrage du module d'import "tmx"...
-- IMPORTER - Reading source files
initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-2.tmx
add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES]
skip file : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/import.xml
initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-5.tmx
add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES]
Tokenizing 4 files
....
Building xml-tei-txm (4 files)
....
-- ANNOTATE - Running NLP tools
004 ..
ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par
aborted.
Process exited abnormally with code 1 at Tuesday, 2 March 2021
Args: 
/usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-5_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt 
java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt (Aucun fichier ou dossier de ce type)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283)
    at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source)
    at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297)
    at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131)
    at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558)
    at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104)
    at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74)
    at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132)
    at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source)
    at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91)
    at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129)
    at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56)
    at org.txm.objects.Project._compute(Project.java:413)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2395)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2282)
    at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27]
Message: Les structures de document XML doivent commencer et se terminer dans la même entité.
.
ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par
aborted.
Process exited abnormally with code 1 at Tuesday, 2 March 2021
Args: 
/usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-2_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt 
java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt (Aucun fichier ou dossier de ce type)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318)
    at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283)
    at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source)
    at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297)
    at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131)
    at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558)
    at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104)
    at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74)
    at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132)
    at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source)
    at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91)
    at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129)
    at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56)
    at org.txm.objects.Project._compute(Project.java:413)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2395)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2282)
    at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56)
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27]
Message: Les structures de document XML doivent commencer et se terminer dans la même entité.
.
langs : [korpus-2_0.xml:RU, korpus-2_1.xml:EN, korpus-5_0.xml:RU, korpus-5_1.xml:EN]
texts : [0:[korpus-2_0.xml, korpus-5_0.xml], 1:[korpus-2_1.xml, korpus-5_1.xml]]
-- COMPILING - Building Search Engine indexes
Using corpus ID: [0:RU0, 1:EN1]
....
P-attributes: [id, ref, xxpos, xxlemma]
S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang]
P-attributes: [id, ref]
S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang]
Writing align.out with 428 positions.
Encoding alignment for [medparcorp_en1, medparcorp_ru0] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out
Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_EN1/medparcorp_ru0.alx ...
I skipped 0 0:1 alignments and 0 1:0 alignments.
Writing align.out with 428 positions.
Encoding alignment for [medparcorp_ru0, medparcorp_en1] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out
Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_RU0/medparcorp_en1.alx ...
I skipped 0 0:1 alignments and 0 1:0 alignments.
-- EDITION - Building edition
..
..
Import terminé en 1.2 sec (1243 ms).
Import terminé.

#2 Updated by Matthieu Decorde about 1 month ago

  • Subject changed from Import, TMX, Language recognition for annotation not working to Import, TMX, text lang is not used
  • Description updated (diff)
  • Category set to Import
  • Priority changed from Normal to High

#3 Updated by Matthieu Decorde about 1 month ago

  • % Done changed from 0 to 80

Also available in: Atom PDF