Bug #1313
RCP: 0.7.7, Impossible to import a corpus with some Unicode characters
Statut: | Closed | Début: | 30/04/2015 | |
---|---|---|---|---|
Priorité: | Urgent | Echéance: | ||
Assigné à: | - | % réalisé: | 100% |
|
Catégorie: | Import | Temps passé: | - | |
Version cible: | TXM 0.8.3 |
Description
It is impossible to import text containing Unicode characters in some ranges.
The following ranges have been reported to cause an error and interrup the import process:- Cuneiform characters range (Akkadian corpus)
- Smiley range (SMS corpus)
- Classical Chinese characters range (Taiwan corpus)
Diagnostic¶
It has been observed that for some Unicode characters outside the first Unicode plane (BMP) some TXM code can write badly some multi-byte Unicode characters (surrogate pairs).
In some contexts, some characters produced after the character can be (wrongly) used by the character resolution scanner mangling the character input stream.
Hypothesis¶
Some TXM code uses the 'char' type to manage input characters but the 'char' type cannot manage characters with code point higher than U+FFFF (16 bits), typically represented by surrogate pairs in Java UTF-16 encoded Strings. See https://stackoverflow.com/questions/1029897/comparing-a-char-to-a-code-point.
Solution¶
Replace all uses of 'char' type in TXM code by using the 'int' type to manage characters.
How to reproduce the error¶
Cuneiform characters range (Akkadian corpus)¶
The following error occurs when validating a tokenized text (it looks like the characters are incorrectly handles by the tokenizer and invalid XML characters are produced):
Execution du script : /home/alavrent/TXM/scripts/import/xmlLoader.groovy Trying to read import properties file: /home/alavrent/xml/corpusakkadiencuneiform/import.properties Trying to read metadatas from: /home/alavrent/xml/corpusakkadiencuneiform/metadata.csv no metadata file: /home/alavrent/xml/corpusakkadiencuneiform/metadata.csv -- IMPORTER - Reading source files Sources clean & validation . Files processed: [/home/alavrent/TXM/corpora/corpusakkadiencuneiform/txm/CORPUSAKKADIENCUNEIFORM/AS_22_6_cuneif-unicode.xml] Tokenizing 1 files . Building XML-TXM (1 files) .Unexpected error while parsing file file:/home/alavrent/TXM/corpora/corpusakkadiencuneiform/tokenized/AS_22_6_cuneif-unicode.xml : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[9,9] Message: La référence de caractère "&# Location line: 9 character: 9 Failed to process /home/alavrent/TXM/corpora/corpusakkadiencuneiform/tokenized/AS_22_6_cuneif-unicode.xml
To reproduce the bug, use the /SpUV/Corpus Akkadien - M. Beranger/Marine_avril_2015-2/AS_22_6_cuneif-unicode.xml file for XML/W+CSV import
Simple Smiley (COP21 corpus)¶
A simple Smiley character can break the TXT+CSV import module with the following error message:
Building xml-tei-txm (1 files) .Unexpected error while parsing file file:/home/sheiden/TXM/corpora/CORPUSROSSI/stokenized/un_cop21_june17.xml : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11776,1028] Message: Le type d'élément "w" doit se terminer par la balise de fin correspondante "</w>". Location line: 11776 character: 1028 javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11776,1028] Message: Le type d'élément "w" doit se terminer par la balise de fin correspondante "</w>".
Smiley range (SMS corpus)¶
Importing 88milSMS corpus (http://88milsms.huma-num.fr/corpus.html).
The problem can be reproduced by importing the 'sms-emoji-sample.xml' file attached to the ticket.
- TXT+CSV import module (June 2015) applied to UTF-8 TXT:
J'irai me renseigner aussi ( si j'trouve le batiment {ici emoji qui casse le wiki redminer} )
produces the following error:
« .** Erreur lors de l'exécution du script groovy : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[12,172627] Message: XML document structures must start and end within the same entity. Moteur de recherche lancé en mode mémoire. »
- XML/w+CSV on the following XML input:
<sms id="92637 »> <date>15 déc. 2011 15:58:27</date> <tel_id>374</tel_id> <cont><SUR_13> en faite jveux bien que t ailles a corb? <emoji description="flushed face" unicode="U+1F633">{ici emoji qui casse le wiki redminer}</emoji> ca <PRE_5>gene trop de te faire faire des aller retour :/ si tu peux pas c est pas grve</cont> </sms>
produces same error.
Classical Chinese characters range (Taiwan corpus)¶
The problem can be reproduced by importing with the XML/w+CSV import module the 'problem2.xml' file attached to the ticket.
I traced the error for a while, then I found the source of the problem seems to be from <w>[impossible to put the original character, it breaks Redmine]</w>. ( In order to let you reproduce the problem, I attach the file to you in this e-mail. The <w>[impossible to put the original character, it breaks Redmine]</w> is in the third line of the attached XML file.) The character ,[impossible to put the original character, it breaks Redmine], is not a common used Chinese Character, so the character is not located in Basic Multilingual Plane (BMP) of unicode. [impossible to put the original character, it breaks Redmine] is belonging to CJK Unified Ideographs Extension B and it is located in Supplementary Ideographic Plane(SIP), and it is a "4-bytes" UTF-8 character. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%F0%A6%9F%9B The 4-bytes UTF-8 caharcters are very easliy casuing processing problmes, could you help us solving this issue? It is a very important issue for who need to work with classical Chinese texts.
Solution¶
Update Eclipse & Java to version 17 fixed the bug
Historique
#1 Mis à jour par Serge Heiden il y a plus de 10 ans
- Fichier problem2.xml
ajouté
- Sujet changé de RCP: 0.7.6, Impossible to import a corpus with cuneiform unicode characters. à RCP: 0.7.7, Impossible to import a corpus with some Unicode characters
- Description mis à jour (diff)
- Priorité changé de Normal à High
- Version cible changé de TXM 0.X.X à TXM 0.7.8
#2 Mis à jour par Serge Heiden il y a plus de 10 ans
- Description mis à jour (diff)
#3 Mis à jour par Serge Heiden il y a environ 10 ans
- Priorité changé de High à Urgent
#4 Mis à jour par Serge Heiden il y a environ 10 ans
- Version cible changé de TXM 0.7.8 à TXM 0.8.0a (split/restructuration)
#5 Mis à jour par Sebastien Jacquot il y a presque 10 ans
We also have problems with some corpora with emojis.
Not really sure about this but after some quick researches:
About the file problem2.xml, it seems the 0xD859 0xDFDB () sequence may only be valid in UTF-16 but not in UTF-8 XML files (see surrogates pairs).
A tmp coding workaround could be to trim or replace all the chars outside the UTF-8 valid range, see: http://www.unicodemap.org/search.asp?search=%F0%A6%9F%9B, http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char, http://blog.mark-mclaren.info/2007/02/invalid-xml-characters-when-valid-utf8_5873.html.
(Java and Xerces manage UTF-16 but I don't think R or CWB can ?)
#6 Mis à jour par Serge Heiden il y a presque 10 ans
- Description mis à jour (diff)
#7 Mis à jour par Serge Heiden il y a presque 10 ans
- Description mis à jour (diff)
#8 Mis à jour par Serge Heiden il y a presque 10 ans
- Description mis à jour (diff)
#9 Mis à jour par Serge Heiden il y a presque 10 ans
- Description mis à jour (diff)
#10 Mis à jour par Serge Heiden il y a presque 10 ans
- Description mis à jour (diff)
#11 Mis à jour par Serge Heiden il y a presque 10 ans
- Description mis à jour (diff)
#12 Mis à jour par Sebastien Jacquot il y a presque 10 ans
More informations:
- despite of my tests, the Xerces impl always writes the 2 values of the UTF-16 surrogates pair instead of the supplementary code point in /org.txm.toolbox/src/groovy/filters/Tokeniser/SimpleTokenizerXml.groovy
- after tests, there is a difference of behaviors between creating the XMLStreamWriter by giving the encoding in its constructor itself or giving it an UTF-8 OutputStreamWriter (maybe a Xerces bug? e.g. see related: https://bugs.openjdk.java.net/browse/JDK-8072081)
Here is a code that demonstrates that:
package main; import java.io.File; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.OutputStreamWriter; import java.io.UnsupportedEncodingException; import javax.xml.stream.FactoryConfigurationError; import javax.xml.stream.XMLOutputFactory; import javax.xml.stream.XMLStreamException; import javax.xml.stream.XMLStreamWriter; public class MainTestXMLUnicodeRanges { public MainTestXMLUnicodeRanges() { // TODO Auto-generated constructor stub } public static void main(String[] args) { // TODO Auto-generated method stub try { XMLStreamWriter writer = XMLOutputFactory.newInstance().createXMLStreamWriter(new FileOutputStream(new File("C:/Tools/Textometrie/___corpus/xml/test_bug_import_chinois/test2.xml")) , "UTF-8"); writer.writeStartDocument("UTF-8","1.0"); writer.writeStartElement("foo"); writer.writeCharacters("\uD835\uDD0A"); //U+1D50A writer.writeCharacters(" "); writer.writeCharacters("\uD859\uDFDB"); //U+267DB writer.writeEndElement(); writer.flush(); writer.close(); XMLStreamWriter writer2 = XMLOutputFactory.newInstance().createXMLStreamWriter(new OutputStreamWriter(new FileOutputStream(new File("C:/Tools/Textometrie/___corpus/xml/test_bug_import_chinois/test2b.xml")), "UTF-8")); writer2.writeStartDocument("UTF-8","1.0"); writer2.writeStartElement("foo"); writer2.writeCharacters("\uD835\uDD0A"); //U+1D50A writer2.writeCharacters(" "); writer2.writeCharacters("\uD859\uDFDB"); //U+267DB writer2.writeEndElement(); writer2.flush(); writer2.close(); } catch(XMLStreamException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch(FactoryConfigurationError e) { // TODO Auto-generated catch block e.printStackTrace(); } catch(FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch(UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
The first code writes the invalid surrogate pairs in the first XML file, the second writes well the good characters.
Therefore initializations in SimpleTokenizer and some other import classes may be replaced from:
output = new OutputStreamWriter(new FileOutputStream(outfile) , "UTF-8"); writer = factory.createXMLStreamWriter(output)
to:
writer = factory.createXMLStreamWriter(new FileOutputStream(outfile), "UTF-8");
The problem may be the same with XML reader instantiation.
I prefer not doing the modifications myself, I do not know very well these "critical" import sections.
There also may have some optimization tests to do in these sections, for example by using some StringBuilder instead of StringBuffer and/or do not recreate them each cycle/Sax event but reset them.
#13 Mis à jour par Serge Heiden il y a presque 10 ans
- Fichier sms-emoji-sample.xml
ajouté
#14 Mis à jour par Serge Heiden il y a presque 10 ans
- Description mis à jour (diff)
#15 Mis à jour par Matthieu Decorde il y a presque 10 ans
- Fichier smileys.xml
ajouté
#16 Mis à jour par Serge Heiden il y a plus de 8 ans
- Description mis à jour (diff)
#17 Mis à jour par Alexey Lavrentev il y a plus de 8 ans
Solution temporaire pour contourner le problème : éliminer les caractères qui posent problème par rechercher remplacer :
1. [^\u0001-\uFFFF] : tout ce qui ne rentre pas dans un seul code en UTF-16ou
2. [^\p{P}\p{L}\p{N}\p{Z}\p{C}\p{S}]+ : tout ce qui ne rentre pas dans les classes suivantes :
- ponctuations
- lettres
- chiffres
- séparateurs
- caractères de contrôle
- symboles
#18 Mis à jour par Serge Heiden il y a plus de 8 ans
- Description mis à jour (diff)
#19 Mis à jour par Serge Heiden il y a plus de 8 ans
- Description mis à jour (diff)
#20 Mis à jour par Serge Heiden il y a plus de 8 ans
- Description mis à jour (diff)
#21 Mis à jour par Sebastien Jacquot il y a plus de 7 ans
- Version cible changé de TXM 0.8.0a (split/restructuration) à TXM 0.8.0
#22 Mis à jour par Matthieu Decorde il y a plus de 6 ans
- Version cible changé de TXM 0.8.0 à TXM 0.8.2
#23 Mis à jour par Matthieu Decorde il y a plus de 4 ans
- Version cible changé de TXM 0.8.2 à TXM 0.8.4
#24 Mis à jour par Matthieu Decorde il y a plus de 2 ans
- Description mis à jour (diff)
- % réalisé changé de 0 à 80
#25 Mis à jour par Matthieu Decorde il y a plus de 2 ans
- Version cible changé de TXM 0.8.4 à TXM 0.8.3
#26 Mis à jour par Sebastien Jacquot il y a plus d'un an
- % réalisé changé de 80 à 100
#27 Mis à jour par Sebastien Jacquot il y a plus d'un an
- Statut changé de New à Closed