Bug #3207: Ubuntu 20.04&Mac OS X, unstable Concordance result - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #3207

Mis à jour par Matthieu Decorde il y a plus de 3 ans

h3. To reproduce 1

import the TRS source file:
<pre>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans>
<Speakers>
<Speaker id="guest2669" name="guest2669" type="unknown" dialect="native" accent="" scope="local"/>
<Speaker id="guest437" name="guest437" type="unknown" dialect="native" accent="" scope="local"/>
</Speakers>
<Episode>
<Section topic="20/03/1957" type="Spirotechnique, télévision pour la recherche sous marine" resume="tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést " startTime="404" endTime="476" synchronized="true">
<Turn startTime="410.53" endTime="411.56" speaker="guest437">
<Sync time="410.53"/>
<w startTime="410.53" endTime="410.72" conf="0.97" pos="P" punct="NONE" case="uc" ne="">Au</w>
<w startTime="410.77" endTime="411.18" conf="0.97" pos="N" punct="NONE" case="uc" ne="">Havre</w>
<w startTime="411.21" endTime="411.34" conf="0.99" pos="P" punct="NONE" case="O" ne="">au</w>
<w startTime="411.34" endTime="411.56" conf="0.99" pos="N" punct="point" case="O" ne="">pied</w>
<w startTime="411.56" endTime="411.56" punct="point">.</w>
</Turn>
</Section>

</Episode>
</Trans>
</pre>

compute a concordance of
<pre>
[_.div_resume=".*é.*"%cd]
</pre>

CQP error messages (patched to display the string) :
<pre>
CL: major error, invalid UTF8 string passed to cl_string_canonical: tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést
[bis bis bis]
CL: major error, invalid UTF8 string passed to cl_string_canonical: tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést
</pre>

h3. To reproduce 2

On Ubuntu 20.04 with corpus *AF-VOIX-OFF-V... corpus*:
<pre>
Démarrage de TXM 0.8.2 (2021-11-17 16h09)…
TXM est prêt.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
16 occurrences.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
17 occurrences.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
19 occurrences.
etc. etc.
</pre>

BUG: A different number of matches is given for each call.

h3. Diagnostic

The console produces the following critical error message:
<pre>
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
etc. etc.

</pre>

*CQP Source Analysis*

This message comes from the @special-chars.c@ CQP source file (extract):

<pre>
<code class="c">
if (charset == utf8) {

/* pointers for UTF8 processing */
gchar *string = NULL;
gchar *precomposed = NULL;
gchar *folded = NULL;
gchar *current_char;
gchar *next_char_begins;

int icase = (flags & IGNORE_CASE) != 0;
int idiac = (flags & IGNORE_DIAC) != 0;

/* UTF8 accent folding */
if (idiac) {
if (NULL == (string = g_utf8_normalize((gchar *)s, -1, G_NORMALIZE_NFD)) ) {
fprintf(stdout, "CL: major error, invalid UTF8 string passed to cl_string_canonical...\n");
return;
}

for (current_char = string; *current_char != '\0'; /* increment is done in-loop */) {
next_char_begins = g_utf8_next_char(current_char);
if (g_unichar_ismark(g_utf8_get_char(current_char))) {
/* downcopy to overwrite the mark character */
strcpy(current_char, next_char_begins);
/* and keep current_char the same */
}
else
current_char = next_char_begins;
}
}
/* end of accent folding */
else
string = (gchar *)s;

/* UTF8 precomposing -- always happens */
/* precomposed = g_utf8_normalize(string, -1, G_NORMALIZE_NFC); */ /* -- duplicate call to g_utf8_normalize() removed */
if (NULL == (precomposed = g_utf8_normalize(string, -1, G_NORMALIZE_NFC)) ) {
fprintf(stdout, "CL: major error, invalid UTF8 string passed to cl_string_canonical...\n");
return;
}
...

</code>
</pre>

The message is displayed in two cases: with or without using the '%d' CQL modifier (if (idiac)...).
This function comes from the glib library: https://docs.gtk.org/glib/func.utf8_normalize.html.
In each case, the @g_utf8_normalize@ function doesn't work (returns NULL).

*diagnostic #1*

The CL error is catastrophic: the Java code receiving the result of this function should ABORT THE COMMAND in this case.

TO DO: add abort code to the Java calling code.

*Binary Analysis*

The debug/LibrariesVersions TXM macro lists the following TXM dynamic libraries:

<pre>
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libzip.so
/usr/lib/TXM-0.8.2beta/plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.800.v20180827-1352/eclipse_1705.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libnet.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libnio.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-pi3-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-cairo-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-atk-gtk-4919.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libawt_xawt.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libawt.so
/home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libsunec.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libfontmanager.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-webkit-gtk-4919.so

</pre>

Of which, @/home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so@ corresponds to CQP.

Which effectively needs the @g_utf8_normalize@ function from glib:

<pre>
$ nm -e /home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so | grep ' g_'
U g_access
U g_dir_close
U g_dir_open
U g_dir_read_name
U g_strreverse
U g_unichar_ismark
U g_utf8_casefold
U g_utf8_collate
U g_utf8_get_char
U g_utf8_normalize
U g_utf8_skip
U g_utf8_strreverse
U g_utf8_validate

</pre>

*Which glib library version is used in TXM 0.8.2beta on Ubuntu 20.04?*

<pre>
$ locate libglib|grep '^/usr/lib'
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so
/usr/lib/cli/glib-sharp-2.0/libglibsharpglue-2.so
/usr/lib/i386-linux-gnu/libglib-2.0.so.0
/usr/lib/i386-linux-gnu/libglib-2.0.so.0.6400.6
/usr/lib/x86_64-linux-gnu/libglib-2.0.a
/usr/lib/x86_64-linux-gnu/libglib-2.0.so
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6
/usr/lib/x86_64-linux-gnu/libglibmm-2.4.so.1
/usr/lib/x86_64-linux-gnu/libglibmm-2.4.so.1.3.0
/usr/lib/x86_64-linux-gnu/libglibmm_generate_extra_defs-2.4.so.1
/usr/lib/x86_64-linux-gnu/libglibmm_generate_extra_defs-2.4.so.1.3.0

</pre>

@/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so@ is the usual suspect.

And effectively, this library provides the binary text for @g_utf8_normalize@:

<pre>
nm -e /usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so|grep g_utf8_normalize
0000000000073ede T g_utf8_normalize
0000000000073b01 T _g_utf8_normalize_wc

</pre>

*Diagnostic #2*

Is the version of this @glib-lite@ compatible with libcqpjni.so?

Retour

Laboratoire ICAR » Plateforme TXM

Bug #3207