Bug #3030
Import, TMX, text lang is not used
Statut: | Closed | Début: | 02/03/2021 | |
---|---|---|---|---|
Priorité: | High | Echéance: | ||
Assigné à: | - | % réalisé: | 100% |
|
Catégorie: | Import | Temps passé: | - | |
Version cible: | TXM 0.8.2 |
Description
Bug reported for TXM 0.8.1.202102250837
Languages are declared explicitly in TMX files, these codes should be used in order to select the appropriate language model.
TXM uses the language selected in the import form. If the "guess" option is chosen, many errors occur because language models are not found.
The language selection should be desactivated for this module.
Broken since TXM 0.8.0
Révisions associées
fix TMX fr-pos annotation per lang refs #3030
Historique
#1 Mis à jour par Alexey Lavrentev il y a plus de 4 ans
Copy of the console when importing a Russian-English TMX corpus. It is very strange that in both cases the guessed language is Hungarian.
Sauvegarde des paramètres d'importation… Corpus "MEDPARCORP_RU0" supprimé(e). Corpus "MEDPARCORP_EN1" supprimé(e). Compiling tmx import module... Démarrage du module d'import "tmx"... -- IMPORTER - Reading source files initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-2.tmx add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES] skip file : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/import.xml initialize writers for : /home/alavrent/Documents/Mes documents/NSU/kurs corpora 2020/Проекты корпусов/14. Medical Parallel Corpus/MedParCorp/korpus-5.tmx add header : [creationtool:manual, creationtoolversion:0.0, datatype:plaintext, segtype:sentence, adminlang:en-us, srclang:RU, o-tmf:ORES] Tokenizing 4 files .... Building xml-tei-txm (4 files) .... -- ANNOTATE - Running NLP tools 004 .. ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par aborted. Process exited abnormally with code 1 at Tuesday, 2 March 2021 Args: /usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-5_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-5_0.xml-out.tt (Aucun fichier ou dossier de ce type) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283) at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source) at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297) at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131) at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558) at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104) at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74) at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132) at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source) at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91) at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129) at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56) at org.txm.objects.Project._compute(Project.java:413) at org.txm.core.results.TXMResult.compute(TXMResult.java:2395) at org.txm.core.results.TXMResult.compute(TXMResult.java:2282) at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56) javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27] Message: Les structures de document XML doivent commencer et se terminer dans la même entité. . ERROR: Can't open for reading: /home/alavrent/Software/TreeTagger/models/hu.par aborted. Process exited abnormally with code 1 at Tuesday, 2 March 2021 Args: /usr/lib/TXM-0.8.1/../../../home/alavrent/.TXM-0.8.1/plugins/org.txm.treetagger.core.linux_1.0.0.202006301717/res/linux/bin/tree-tagger -token -lemma -sgml -no-unknown -cap-heuristics -quiet -eos-tag <s> /home/alavrent/Software/TreeTagger/models/hu.par /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/ptreetagger/korpus-2_0.xml-src.tt /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt java.io.FileNotFoundException: /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/treetagger/korpus-2_0.xml-out.tt (Aucun fichier ou dossier de ce type) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at org.codehaus.groovy.runtime.ResourceGroovyMethods.newReader(ResourceGroovyMethods.java:1793) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:318) at org.codehaus.groovy.runtime.ResourceGroovyMethods.eachLine(ResourceGroovyMethods.java:283) at org.codehaus.groovy.runtime.dgm$988.doMethodInvoke(Unknown Source) at org.txm.importer.xmltxm.CSV2W_ANA.writeBody(CSV2W_ANA.groovy:297) at org.txm.importer.xmltxm.CSV2W_ANA.process(CSV2W_ANA.groovy:131) at org.txm.importer.xmltxm.Annotate.writeStandoffFile(Annotate.groovy:282) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:592) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558) at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104) at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74) at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132) at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source) at org.txm.scripts.importer.tmx.tmxLoader.run(tmxLoader.groovy:91) at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129) at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56) at org.txm.objects.Project._compute(Project.java:413) at org.txm.core.results.TXMResult.compute(TXMResult.java:2395) at org.txm.core.results.TXMResult.compute(TXMResult.java:2282) at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56) javax.xml.stream.XMLStreamException: ParseError at [row,col]:[19,27] Message: Les structures de document XML doivent commencer et se terminer dans la même entité. . langs : [korpus-2_0.xml:RU, korpus-2_1.xml:EN, korpus-5_0.xml:RU, korpus-5_1.xml:EN] texts : [0:[korpus-2_0.xml, korpus-5_0.xml], 1:[korpus-2_1.xml, korpus-5_1.xml]] -- COMPILING - Building Search Engine indexes Using corpus ID: [0:RU0, 1:EN1] .... P-attributes: [id, ref, xxpos, xxlemma] S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang] P-attributes: [id, ref] S-attributes: [seg:0+id, text:0+id+base+project, tu:0+tuid, txmcorpus:0+id+lang] Writing align.out with 428 positions. Encoding alignment for [medparcorp_en1, medparcorp_ru0] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_EN1/medparcorp_ru0.alx ... I skipped 0 0:1 alignments and 0 1:0 alignments. Writing align.out with 428 positions. Encoding alignment for [medparcorp_ru0, medparcorp_en1] from file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/align.out Writing file /home/alavrent/TXM-0.8.1/corpora/MEDPARCORP/data/MEDPARCORP_RU0/medparcorp_en1.alx ... I skipped 0 0:1 alignments and 0 1:0 alignments. -- EDITION - Building edition .. .. Import terminé en 1.2 sec (1243 ms). Import terminé.
#2 Mis à jour par Matthieu Decorde il y a plus de 4 ans
- Sujet changé de Import, TMX, Language recognition for annotation not working à Import, TMX, text lang is not used
- Description mis à jour (diff)
- Catégorie mis à Import
- Priorité changé de Normal à High
#3 Mis à jour par Matthieu Decorde il y a plus de 4 ans
- % réalisé changé de 0 à 80
#4 Mis à jour par Sebastien Jacquot il y a plus d'un an
- % réalisé changé de 80 à 100
#5 Mis à jour par Sebastien Jacquot il y a plus d'un an
- Statut changé de New à Closed