Bug #3207

Mis à jour par Matthieu Decorde il y a plus de 3 ans

h3. To reproduce

On Ubuntu 20.04 with corpus *AF-VOIX-OFF-V... corpus*: AF-VOIX-OFF-V4-2021-05-19:
<pre>
Démarrage de TXM 0.8.2 (2021-11-17 16h09)…
TXM est prêt.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
16 occurrences.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
17 occurrences.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
19 occurrences.
etc. etc.
</pre>

BUG: A different number of matches is given for each call.

h3. Diagnostic

The console produces the following critical error message:
<pre>
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
etc. etc.

</pre>

*CQP Source *Source Analysis*

This message comes from the @special-chars.c@ CQP source file (extract):

<pre>
<code class="c">
if (charset == utf8) {

/* pointers for UTF8 processing */
gchar *string = NULL;
gchar *precomposed = NULL;
gchar *folded = NULL;
gchar *current_char;
gchar *next_char_begins;

int icase = (flags & IGNORE_CASE) != 0;
int idiac = (flags & IGNORE_DIAC) != 0;

/* UTF8 accent folding */
if (idiac) {
if (NULL == (string = g_utf8_normalize((gchar *)s, -1, G_NORMALIZE_NFD)) ) {
fprintf(stdout, "CL: major error, invalid UTF8 string passed to cl_string_canonical...\n");
return;
}

for (current_char = string; *current_char != '\0'; /* increment is done in-loop */) {
next_char_begins = g_utf8_next_char(current_char);
if (g_unichar_ismark(g_utf8_get_char(current_char))) {
/* downcopy to overwrite the mark character */
strcpy(current_char, next_char_begins);
/* and keep current_char the same */
}
else
current_char = next_char_begins;
}
}
/* end of accent folding */
else
string = (gchar *)s;

/* UTF8 precomposing -- always happens */
/* precomposed = g_utf8_normalize(string, -1, G_NORMALIZE_NFC); */ /* -- duplicate call to g_utf8_normalize() removed */
if (NULL == (precomposed = g_utf8_normalize(string, -1, G_NORMALIZE_NFC)) ) {
fprintf(stdout, "CL: major error, invalid UTF8 string passed to cl_string_canonical...\n");
return;
}
...

</code>
</pre>

The message is displayed in two cases: with or without using the '%d' CQL modifier (if (idiac)...).
This function comes from the glib library: https://docs.gtk.org/glib/func.utf8_normalize.html.
In each case, the @g_utf8_normalize@ function doesn't work (returns NULL).

*diagnostic #1*

The CL error is catastrophic: the Java code receiving the result of this function should ABORT THE COMMAND in this case.

TO DO: add abort code to the Java calling code.

*Binary Analysis*

The debug/LibrariesVersions TXM macro lists the following TXM dynamic libraries:

<pre>
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libzip.so
/usr/lib/TXM-0.8.2beta/plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.800.v20180827-1352/eclipse_1705.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libnet.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libnio.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-pi3-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-cairo-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-atk-gtk-4919.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libawt_xawt.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libawt.so
/home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libsunec.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libfontmanager.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-webkit-gtk-4919.so

</pre>

Of which, @/home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so@ corresponds to CQP.

Which effectively needs the @g_utf8_normalize@ function from glib:

<pre>
$ nm -e /home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so | grep ' g_'
U g_access
U g_dir_close
U g_dir_open
U g_dir_read_name
U g_strreverse
U g_unichar_ismark
U g_utf8_casefold
U g_utf8_collate
U g_utf8_get_char
U g_utf8_normalize
U g_utf8_skip
U g_utf8_strreverse
U g_utf8_validate

</pre>

*Which glib library version is used in TXM 0.8.2beta on Ubuntu 20.04?*

<pre>
$ locate libglib|grep '^/usr/lib'
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so
/usr/lib/cli/glib-sharp-2.0/libglibsharpglue-2.so
/usr/lib/i386-linux-gnu/libglib-2.0.so.0
/usr/lib/i386-linux-gnu/libglib-2.0.so.0.6400.6
/usr/lib/x86_64-linux-gnu/libglib-2.0.a
/usr/lib/x86_64-linux-gnu/libglib-2.0.so
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6
/usr/lib/x86_64-linux-gnu/libglibmm-2.4.so.1
/usr/lib/x86_64-linux-gnu/libglibmm-2.4.so.1.3.0
/usr/lib/x86_64-linux-gnu/libglibmm_generate_extra_defs-2.4.so.1
/usr/lib/x86_64-linux-gnu/libglibmm_generate_extra_defs-2.4.so.1.3.0

</pre>

@/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so@ is the usual suspect.

And effectively, this library provides the binary text for @g_utf8_normalize@:

<pre>
nm -e /usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so|grep g_utf8_normalize
0000000000073ede T g_utf8_normalize
0000000000073b01 T _g_utf8_normalize_wc

</pre>

*Diagnostic #2*

Is the version of this @glib-lite@ compatible with libcqpjni.so?

Retour