Bug #3042
Import, malformed XML produced when injecting TreeTagger annotation of some tokens
Status: | New | Start date: | 03/22/2021 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Import | Spent time: | - | |
Target version: | - |
Description
This bug occurs in a corpus of Russian tweets where smileys, hashtags, user mentions ans URLs were pre-tagged with <w>.
The corpus is imported correctly when TreeTagger is not used.
When TreeTagger is used, the annotation files are created normally but an error occurs when injecting annotations into TEI TXM files.
To reproduce the bug, download the source files from Shardocs [[https://sharedocs.huma-num.fr/wl/?id=AJ63N3TK9AQv6NwhEOxVOok6s7a3WRVB]] and run TEI XML Zero import module with annotation. The bug occurs the same way with "ru" and "en" languages selected.
If annotation os turned off, the import works fine.
Here is the console:
Sauvegarde des paramètres d'importation… Corpus "TWITTER-SENTIMENT2-2" supprimé(e). Compiling xtz import module... Démarrage du module d'import "xtz"... Import du corpus... -- IMPORTER - Reading source files -- Split-Merge XSL Step with /home/alavrent/Documents/Mes documents/NSU/D. Tupikina/Twitter_sentiment2 (2)/xsl/1-split-merge -- Front XSL Step with the /home/alavrent/Documents/Mes documents/NSU/D. Tupikina/Twitter_sentiment2 (2)/xsl/2-front directory. -- Checking XML-TEI files for well-formedness. 003 ... -- Tokenizing 3 files 003 ... -- Building XML-TXM (3 files) 003 ... -- ANNOTATE - Running NLP tools Unexpected error while parsing file file:/home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_63.xml : java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream. Location line: 11281 character: 123 java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:547) at org.txm.importer.scripts.xmltxm.AnnotationInjection.getNextAnaValue(AnnotationInjection.groovy:259) at org.txm.importer.scripts.xmltxm.AnnotationInjection.writeAnaTags(AnnotationInjection.groovy:296) at org.txm.importer.scripts.xmltxm.AnnotationInjection.processEndElement(AnnotationInjection.groovy:363) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:257) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:151) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:144) at org.txm.importer.scripts.xmltxm.AnnotationInjection.super$2$process(AnnotationInjection.groovy) at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146) at org.txm.importer.scripts.xmltxm.AnnotationInjection.process(AnnotationInjection.groovy:226) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:597) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558) at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104) at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74) at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132) at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source) at org.txm.scripts.importer.xtz.TTAnnotater.process(TTAnnotater.groovy:41) at org.txm.importer.xtz.ImportModule.start(ImportModule.java:189) at org.txm.scripts.importer.xtz.XTZImport.super$2$start(XTZImport.groovy) at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166) at org.txm.scripts.importer.xtz.XTZImport.start(XTZImport.groovy:105) at org.txm.importer.xtz.ImportModule.process(ImportModule.java:329) at org.txm.importer.xtz.ImportModule$process$3.call(Unknown Source) at org.txm.scripts.importer.xtz.xtzLoader.run(xtzLoader.groovy:58) at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129) at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56) at org.txm.objects.Project._compute(Project.java:413) at org.txm.core.results.TXMResult.compute(TXMResult.java:2395) at org.txm.core.results.TXMResult.compute(TXMResult.java:2282) at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56) 003 .Unexpected error while parsing file file:/home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_66.xml : java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream. Location line: 11075 character: 120 java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:547) at org.txm.importer.scripts.xmltxm.AnnotationInjection.getNextAnaValue(AnnotationInjection.groovy:259) at org.txm.importer.scripts.xmltxm.AnnotationInjection.writeAnaTags(AnnotationInjection.groovy:296) at org.txm.importer.scripts.xmltxm.AnnotationInjection.processEndElement(AnnotationInjection.groovy:363) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:257) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:151) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:144) at org.txm.importer.scripts.xmltxm.AnnotationInjection.super$2$process(AnnotationInjection.groovy) at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146) at org.txm.importer.scripts.xmltxm.AnnotationInjection.process(AnnotationInjection.groovy:226) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:597) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558) at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104) at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74) at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132) at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source) at org.txm.scripts.importer.xtz.TTAnnotater.process(TTAnnotater.groovy:41) at org.txm.importer.xtz.ImportModule.start(ImportModule.java:189) at org.txm.scripts.importer.xtz.XTZImport.super$2$start(XTZImport.groovy) at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166) at org.txm.scripts.importer.xtz.XTZImport.start(XTZImport.groovy:105) at org.txm.importer.xtz.ImportModule.process(ImportModule.java:329) at org.txm.importer.xtz.ImportModule$process$3.call(Unknown Source) at org.txm.scripts.importer.xtz.xtzLoader.run(xtzLoader.groovy:58) at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129) at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56) at org.txm.objects.Project._compute(Project.java:413) at org.txm.core.results.TXMResult.compute(TXMResult.java:2395) at org.txm.core.results.TXMResult.compute(TXMResult.java:2282) at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56) .Unexpected error while parsing file file:/home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_90.xml : java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream. Location line: 11282 character: 124 java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:547) at org.txm.importer.scripts.xmltxm.AnnotationInjection.getNextAnaValue(AnnotationInjection.groovy:259) at org.txm.importer.scripts.xmltxm.AnnotationInjection.writeAnaTags(AnnotationInjection.groovy:296) at org.txm.importer.scripts.xmltxm.AnnotationInjection.processEndElement(AnnotationInjection.groovy:363) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:257) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:151) at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:144) at org.txm.importer.scripts.xmltxm.AnnotationInjection.super$2$process(AnnotationInjection.groovy) at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146) at org.txm.importer.scripts.xmltxm.AnnotationInjection.process(AnnotationInjection.groovy:226) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:597) at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558) at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104) at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74) at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132) at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source) at org.txm.scripts.importer.xtz.TTAnnotater.process(TTAnnotater.groovy:41) at org.txm.importer.xtz.ImportModule.start(ImportModule.java:189) at org.txm.scripts.importer.xtz.XTZImport.super$2$start(XTZImport.groovy) at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104) at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326) at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146) at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166) at org.txm.scripts.importer.xtz.XTZImport.start(XTZImport.groovy:105) at org.txm.importer.xtz.ImportModule.process(ImportModule.java:329) at org.txm.importer.xtz.ImportModule$process$3.call(Unknown Source) at org.txm.scripts.importer.xtz.xtzLoader.run(xtzLoader.groovy:58) at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129) at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56) at org.txm.objects.Project._compute(Project.java:413) at org.txm.core.results.TXMResult.compute(TXMResult.java:2395) at org.txm.core.results.TXMResult.compute(TXMResult.java:2282) at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161) at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56) . -- COMPILING - Building Search Engine indexes -- Scanning structures&properties to create for 3 texts... 003 .Error while processing positive_long_half2_smiley1_63 (file: /home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_63.xml) text XML-TXM file : null. Error: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11291,123] Message: Les structures de document XML doivent commencer et se terminer dans la même entité. javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11291,123] Message: Les structures de document XML doivent commencer et se terminer dans la même entité. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604) at org.txm.importer.SAttributesListener.scanFile(SAttributesListener.java:232) at org.txm.importer.SAttributesListener$scanFile.call(Unknown Source) at org.txm.scripts.importer.xtz.XTZCompiler.doScanStep(XTZCompiler.groovy:121) at org.txm.scripts.importer.xtz.XTZCompiler._process(XTZCompiler.groovy:91) at org.txm.importer.xtz.Compiler.process(Compiler.java:66) at org.txm.importer.xtz.ImportModule$1.run(ImportModule.java:211) Error while importing corpus during 'compiler' step, reason=not set. Corpus "TWITTER-SENTIMENT2-2" supprimé(e).
When the pre-tagged words with the attribute type values "old-smiley_br", "old_smiley", "user" and w inside ref (hyperlinks) are replaced with simple codes, the annotation is injected correctly (activate the front xsl transformation to test).