Bug #1313: RCP: 0.7.7, Impossible to import a corpus with some Unicode characters - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #1313

Mis à jour par Serge Heiden il y a plus de 8 ans

It is impossible to import text containing Unicode characters in some ranges.

The following ranges have been reported to cause an error and interrup the import process:
* Cuneiform characters range (Akkadian corpus)
* Smiley range (SMS corpus)
* Classical Chinese characters range (Taiwan corpus)

h3. Diagnostic

It has been observed that for some Unicode characters outside the first Unicode plane (BMP) some TXM code the XML libraries can write badly some multi-byte Unicode characters (surrogate pairs).

In some contexts, some characters produced after the character can be (wrongly) used by the character resolution scanner mangling the character input stream.

h3. Hypothesis

Some TXM code uses the 'char' type to manage input characters but the 'char' type cannot manage characters with code point higher than U+FFFF (16 bits). See https://stackoverflow.com/questions/1029897/comparing-a-char-to-a-code-point.

h3. Solution

Replace all uses Here is what Java says about 'out of 'char' type BMP' Unicode character processing (in 2004): "Supplementary Characters in TXM code by using the 'int' type to manage characters. Java Platform":http://www.oracle.com/us/technologies/java/supplementary-142654.html .

Here is a Java bug report:
* https://bugs.openjdk.java.net/browse/JDK-8073700

And maybe a resolution:
* https://bugs.openjdk.java.net/browse/JDK-8145974

h3. How to reproduce the error

h4. Cuneiform characters range (Akkadian corpus)

The following error occurs when validating a tokenized text (it looks like the characters are incorrectly handles by the tokenizer and invalid XML characters are produced):

<pre>
Execution du script : /home/alavrent/TXM/scripts/import/xmlLoader.groovy
Trying to read import properties file: /home/alavrent/xml/corpusakkadiencuneiform/import.properties
Trying to read metadatas from: /home/alavrent/xml/corpusakkadiencuneiform/metadata.csv
no metadata file: /home/alavrent/xml/corpusakkadiencuneiform/metadata.csv
-- IMPORTER - Reading source files
Sources clean & validation
.
Files processed: [/home/alavrent/TXM/corpora/corpusakkadiencuneiform/txm/CORPUSAKKADIENCUNEIFORM/AS_22_6_cuneif-unicode.xml]
Tokenizing 1 files
.
Building XML-TXM (1 files)
.Unexpected error while parsing file file:/home/alavrent/TXM/corpora/corpusakkadiencuneiform/tokenized/AS_22_6_cuneif-unicode.xml : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[9,9]
Message: La référence de caractère "&#
Location line: 9 character: 9
Failed to process /home/alavrent/TXM/corpora/corpusakkadiencuneiform/tokenized/AS_22_6_cuneif-unicode.xml
</pre>

To reproduce the bug, use the /SpUV/Corpus Akkadien - M. Beranger/Marine_avril_2015-2/AS_22_6_cuneif-unicode.xml file for XML/W+CSV import

h4. Simple Smiley (COP21 corpus)

A simple Smiley character can break the TXT+CSV import module with the following error message:
<pre>
Building xml-tei-txm (1 files)
.Unexpected error while parsing file file:/home/sheiden/TXM/corpora/CORPUSROSSI/stokenized/un_cop21_june17.xml : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11776,1028]
Message: Le type d'élément "w" doit se terminer par la balise de fin correspondante "</w>".
Location line: 11776 character: 1028
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[11776,1028]
Message: Le type d'élément "w" doit se terminer par la balise de fin correspondante "</w>".
</pre>

h4. Smiley range (SMS corpus)

Importing 88milSMS corpus (http://88milsms.huma-num.fr/corpus.html).

The problem can be reproduced by importing the 'sms-emoji-sample.xml' file attached to the ticket.

- *TXT+CSV import* module (June 2015) applied to UTF-8 TXT:
<pre>
J'irai me renseigner aussi ( si j'trouve le batiment {ici emoji qui casse le wiki redminer} )
</pre>
produces the following error:
<pre>
« .** Erreur lors de l'exécution du script groovy : javax.xml.stream.XMLStreamException: ParseError at [row,col]:[12,172627]
Message: XML document structures must start and end within the same entity.
Moteur de recherche lancé en mode mémoire. »
</pre>

- *XML/w+CSV* on the following XML input:
<pre>
<sms id="92637 »>
<date>15 déc. 2011 15:58:27</date>
<tel_id>374</tel_id>
<cont><SUR_13> en faite jveux bien que t ailles a corb? <emoji description="flushed face" unicode="U+1F633">{ici emoji qui casse le wiki redminer}</emoji> ca <PRE_5>gene trop de te faire faire des aller retour :/ si tu peux pas c est pas grve</cont>
</sms>
</pre>
produces same error.

h4. Classical Chinese characters range (Taiwan corpus)

The problem can be reproduced by importing with the XML/w+CSV import module the 'problem2.xml' file attached to the ticket.

<pre>
I traced the error for a while, then I found the source of the problem seems to be from <w>[impossible to put the original character, it breaks Redmine]</w>.
( In order to let you reproduce the problem, I attach the file to you in this e-mail. The <w>[impossible to put the original character, it breaks Redmine]</w> is in the third line of the attached XML file.)

The character ,[impossible to put the original character, it breaks Redmine], is not a common used Chinese Character, so the character is not located in Basic Multilingual Plane (BMP) of unicode. [impossible to put the original character, it breaks Redmine] is belonging to CJK Unified Ideographs Extension B and it is located in Supplementary Ideographic Plane(SIP), and it is a "4-bytes" UTF-8 character.
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=%F0%A6%9F%9B

The 4-bytes UTF-8 caharcters are very easliy casuing processing problmes, could you help us solving this issue? It is a very important issue for who need to work with classical Chinese texts.
</pre>

Retour

Laboratoire ICAR » Plateforme TXM

Bug #1313