Bug #3030
Import, TMX, text lang is not used
Status: | New | Start date: | 03/02/2021 | ||
---|---|---|---|---|---|
Priority: | High | Due date: | |||
Assignee: | - | % Done: | 80% |
||
Category: | Import | Spent time: | - | ||
Target version: | TXM 0.8.2 |
Description
Bug reported for TXM 0.8.1.202102250837
Languages are declared explicitly in TMX files, these codes should be used in order to select the appropriate language model.
TXM uses the language selected in the import form. If the "guess" option is chosen, many errors occur because language models are not found.
The language selection should be desactivated for this module.
Broken since TXM 0.8.0
Associated revisions
fix TMX fr-pos annotation per lang refs #3030
History
#1 Updated by Alexey Lavrentev over 2 years ago
Copy of the console when importing a Russian-English TMX corpus. It is very strange that in both cases the guessed language is Hungarian.
Sauvegarde des paramètres d'importation… Corpus "MEDPARCORP_RU0" supprimé(e). Corpus "MEDPARCORP_EN1" supprimé(e). Compiling tmx import module... Démarrage du module d'import "tmx"... -- IMPORTER - Reading source files initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-2.tmx add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES] skip file : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/import.xml initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-5.tmx add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES] Tokenizing 4 files .... Building xml-tei-txm (4 files) .... -- ANNOTATE - Running NLP tools 004 .. ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par aborted. Process exited abnormally with code 1 at Tuesday, 2 March 2021 Args: /usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-5_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt (Aucun fichier ou dossier de ce type) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283) at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source) at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297) at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131) at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558) at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104) at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74) at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132) at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source) at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91) at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129) at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56) at org.txm.objects.Project._compute(Project.java:413) at org.txm.core.results.TXMResult.compute(TXMResult.java:2395) at org.txm.core.results.TXMResult.compute(TXMResult.java:2282) at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56) javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27] Message: Les structures de document XML doivent commencer et se terminer dans la même entité. . ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par aborted. Process exited abnormally with code 1 at Tuesday, 2 March 2021 Args: /usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-2_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt (Aucun fichier ou dossier de ce type) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283) at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source) at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297) at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131) at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558) at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104) at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74) at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132) at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source) at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91) at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129) at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56) at org.txm.objects.Project._compute(Project.java:413) at org.txm.core.results.TXMResult.compute(TXMResult.java:2395) at org.txm.core.results.TXMResult.compute(TXMResult.java:2282) at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56) javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27] Message: Les structures de document XML doivent commencer et se terminer dans la même entité. . langs : [korpus-2_0.xml:RU, korpus-2_1.xml:EN, korpus-5_0.xml:RU, korpus-5_1.xml:EN] texts : [0:[korpus-2_0.xml, korpus-5_0.xml], 1:[korpus-2_1.xml, korpus-5_1.xml]] -- COMPILING - Building Search Engine indexes Using corpus ID: [0:RU0, 1:EN1] .... P-attributes: [id, ref, xxpos, xxlemma] S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang] P-attributes: [id, ref] S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang] Writing align.out with 428 positions. Encoding alignment for [medparcorp_en1, medparcorp_ru0] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_EN1/medparcorp_ru0.alx ... I skipped 0 0:1 alignments and 0 1:0 alignments. Writing align.out with 428 positions. Encoding alignment for [medparcorp_ru0, medparcorp_en1] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_RU0/medparcorp_en1.alx ... I skipped 0 0:1 alignments and 0 1:0 alignments. -- EDITION - Building edition .. .. Import terminé en 1.2 sec (1243 ms). Import terminé.
#2 Updated by Matthieu Decorde over 2 years ago
- Subject changed from Import, TMX, Language recognition for annotation not working to Import, TMX, text lang is not used
- Description updated (diff)
- Category set to Import
- Priority changed from Normal to High
#3 Updated by Matthieu Decorde about 2 years ago
- % Done changed from 0 to 80