Bug #3042

Import, malformed XML produced when injecting TreeTagger annotation of some tokens

Added by Alexey Lavrentev about 1 month ago.

Status:New Start date:03/22/2021
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:-

Description

This bug occurs in a corpus of Russian tweets where smileys, hashtags, user mentions ans URLs were pre-tagged with <w>.
The corpus is imported correctly when TreeTagger is not used.
When TreeTagger is used, the annotation files are created normally but an error occurs when injecting annotations into TEI TXM files.

To reproduce the bug, download the source files from Shardocs [[https://sharedocs.huma-num.fr/wl/?id=AJ63N3TK9AQv6NwhEOxVOok6s7a3WRVB]] and run TEI XML Zero import module with annotation. The bug occurs the same way with "ru" and "en" languages selected.
If annotation os turned off, the import works fine.

Here is the console:

Sauvegarde des paramètres d'importation…
Corpus "TWITTER-SENTIMENT2-2" supprimé(e).
Compiling xtz import module...
Démarrage du module d'import "xtz"...
Import du corpus...
-- IMPORTER - Reading source files
-- Split-Merge XSL Step with /home/alavrent/Documents/Mes documents/NSU/D. Tupikina/Twitter_sentiment2 (2)/xsl/1-split-merge
-- Front XSL Step with the /home/alavrent/Documents/Mes documents/NSU/D. Tupikina/Twitter_sentiment2 (2)/xsl/2-front directory.
-- Checking XML-TEI files for well-formedness.
003 ...
-- Tokenizing 3 files
003 ...
-- Building XML-TXM (3 files)
003 ...
-- ANNOTATE - Running NLP tools
Unexpected error while parsing file file:/home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_63.xml : java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream.
Location line: 11281 character: 123
java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:547)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.getNextAnaValue(AnnotationInjection.groovy:259)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.writeAnaTags(AnnotationInjection.groovy:296)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.processEndElement(AnnotationInjection.groovy:363)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:257)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:151)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:144)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.super$2$process(AnnotationInjection.groovy)
    at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.process(AnnotationInjection.groovy:226)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:597)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558)
    at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104)
    at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74)
    at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132)
    at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source)
    at org.txm.scripts.importer.xtz.TTAnnotater.process(TTAnnotater.groovy:41)
    at org.txm.importer.xtz.ImportModule.start(ImportModule.java:189)
    at org.txm.scripts.importer.xtz.XTZImport.super$2$start(XTZImport.groovy)
    at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166)
    at org.txm.scripts.importer.xtz.XTZImport.start(XTZImport.groovy:105)
    at org.txm.importer.xtz.ImportModule.process(ImportModule.java:329)
    at org.txm.importer.xtz.ImportModule$process$3.call(Unknown Source)
    at org.txm.scripts.importer.xtz.xtzLoader.run(xtzLoader.groovy:58)
    at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129)
    at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56)
    at org.txm.objects.Project._compute(Project.java:413)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2395)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2282)
    at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56)
003 .Unexpected error while parsing file file:/home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_66.xml : java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream.
Location line: 11075 character: 120
java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:547)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.getNextAnaValue(AnnotationInjection.groovy:259)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.writeAnaTags(AnnotationInjection.groovy:296)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.processEndElement(AnnotationInjection.groovy:363)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:257)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:151)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:144)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.super$2$process(AnnotationInjection.groovy)
    at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.process(AnnotationInjection.groovy:226)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:597)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558)
    at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104)
    at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74)
    at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132)
    at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source)
    at org.txm.scripts.importer.xtz.TTAnnotater.process(TTAnnotater.groovy:41)
    at org.txm.importer.xtz.ImportModule.start(ImportModule.java:189)
    at org.txm.scripts.importer.xtz.XTZImport.super$2$start(XTZImport.groovy)
    at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166)
    at org.txm.scripts.importer.xtz.XTZImport.start(XTZImport.groovy:105)
    at org.txm.importer.xtz.ImportModule.process(ImportModule.java:329)
    at org.txm.importer.xtz.ImportModule$process$3.call(Unknown Source)
    at org.txm.scripts.importer.xtz.xtzLoader.run(xtzLoader.groovy:58)
    at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129)
    at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56)
    at org.txm.objects.Project._compute(Project.java:413)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2395)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2282)
    at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56)
.Unexpected error while parsing file file:/home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_90.xml : java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream.
Location line: 11282 character: 124
java.util.NoSuchElementException: END_DOCUMENT reached: no more elements on the stream.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:547)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.getNextAnaValue(AnnotationInjection.groovy:259)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.writeAnaTags(AnnotationInjection.groovy:296)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.processEndElement(AnnotationInjection.groovy:363)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:257)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:151)
    at org.txm.importer.StaxIdentityParser.process(StaxIdentityParser.java:144)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.super$2$process(AnnotationInjection.groovy)
    at sun.reflect.GeneratedMethodAccessor69.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
    at org.txm.importer.scripts.xmltxm.AnnotationInjection.process(AnnotationInjection.groovy:226)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:597)
    at org.txm.importer.xmltxm.Annotate.run(Annotate.groovy:558)
    at org.txm.treetagger.core.TreeTaggerEngine.processFile(TreeTaggerEngine.java:104)
    at org.txm.annotation.core.AnnotationEngine.processDirectory(AnnotationEngine.java:74)
    at org.txm.treetagger.core.TreeTaggerEngine.processDirectory(TreeTaggerEngine.java:132)
    at org.txm.treetagger.core.TreeTaggerEngine$processDirectory.call(Unknown Source)
    at org.txm.scripts.importer.xtz.TTAnnotater.process(TTAnnotater.groovy:41)
    at org.txm.importer.xtz.ImportModule.start(ImportModule.java:189)
    at org.txm.scripts.importer.xtz.XTZImport.super$2$start(XTZImport.groovy)
    at sun.reflect.GeneratedMethodAccessor89.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:104)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:326)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1235)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:146)
    at org.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:166)
    at org.txm.scripts.importer.xtz.XTZImport.start(XTZImport.groovy:105)
    at org.txm.importer.xtz.ImportModule.process(ImportModule.java:329)
    at org.txm.importer.xtz.ImportModule$process$3.call(Unknown Source)
    at org.txm.scripts.importer.xtz.xtzLoader.run(xtzLoader.groovy:58)
    at org.txm.groovy.core.GroovyScriptedImportEngine._build(GroovyScriptedImportEngine.java:129)
    at org.txm.core.engines.ScriptedImportEngine.build(ScriptedImportEngine.java:56)
    at org.txm.objects.Project._compute(Project.java:413)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2395)
    at org.txm.core.results.TXMResult.compute(TXMResult.java:2282)
    at org.txm.rcp.handlers.scripts.ExecuteImportScript$2.run(ExecuteImportScript.java:161)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:56)
.
-- COMPILING - Building Search Engine indexes
-- Scanning structures&properties to create for 3 texts...
003 .Error while processing positive_long_half2_smiley1_63 (file: /home/alavrent/TXM-0.8.1/corpora/TWITTER-SENTIMENT2-2/txm/TWITTER-SENTIMENT2-2/positive_long_half2_smiley1_63.xml) text XML-TXM file : null. Error: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11291,123]
Message: Les structures de document XML doivent commencer et se terminer dans la même entité.
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11291,123]
Message: Les structures de document XML doivent commencer et se terminer dans la même entité.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604)
    at org.txm.importer.SAttributesListener.scanFile(SAttributesListener.java:232)
    at org.txm.importer.SAttributesListener$scanFile.call(Unknown Source)
    at org.txm.scripts.importer.xtz.XTZCompiler.doScanStep(XTZCompiler.groovy:121)
    at org.txm.scripts.importer.xtz.XTZCompiler._process(XTZCompiler.groovy:91)
    at org.txm.importer.xtz.Compiler.process(Compiler.java:66)
    at org.txm.importer.xtz.ImportModule$1.run(ImportModule.java:211)
Error while importing corpus during 'compiler' step, reason=not set.
Corpus "TWITTER-SENTIMENT2-2" supprimé(e).

When the pre-tagged words with the attribute type values "old-smiley_br", "old_smiley", "user" and w inside ref (hyperlinks) are replaced with simple codes, the annotation is injected correctly (activate the front xsl transformation to test).

Also available in: Atom PDF