Bug #3207

Ubuntu 20.04&Mac OS X, unstable Concordance result

Added by Serge Heiden about 1 year ago. Updated about 1 month ago.

Status:New Start date:01/18/2022
Priority:Urgent Due date:
Assignee:- % Done:

70%

Category:SearchEngine Spent time: -
Target version:TXM 0.8.3

Description

To reproduce 1

import the TRS source file:

 1<?xml version="1.0" encoding="UTF-8"?>
 2<!DOCTYPE Trans SYSTEM "trans-14.dtd">
 3<Trans>
 4  <Speakers>
 5    <Speaker id="guest2669" name="guest2669" type="unknown" dialect="native" accent="" scope="local"/>
 6    <Speaker id="guest437" name="guest437" type="unknown" dialect="native" accent="" scope="local"/>
 7  </Speakers>
 8  <Episode>
 9    <Section topic="20/03/1957" type="Spirotechnique, télévision pour la recherche sous marine" resume="tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést "  startTime="404" endTime="476" synchronized="true">
10      <Turn startTime="410.53" endTime="411.56" speaker="guest437">
11        <Sync time="410.53"/>
12        <w startTime="410.53" endTime="410.72" conf="0.97" pos="P" punct="NONE" case="uc" ne="">Au</w>
13        <w startTime="410.77" endTime="411.18" conf="0.97" pos="N" punct="NONE" case="uc" ne="">Havre</w>
14        <w startTime="411.21" endTime="411.34" conf="0.99" pos="P" punct="NONE" case="O" ne="">au</w>
15        <w startTime="411.34" endTime="411.56" conf="0.99" pos="N" punct="point" case="O" ne="">pied</w>
16        <w startTime="411.56" endTime="411.56" punct="point">.</w>
17      </Turn>
18    </Section>
19
20  </Episode>
21</Trans>

compute a concordance of

[_.div_resume=".*é.*"%cd]

CQP error messages (patched to display the string) :

CL: major error, invalid UTF8 string passed to cl_string_canonical: tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést
[bis bis bis]
CL: major error, invalid UTF8 string passed to cl_string_canonical: tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést tést

To reproduce 2

On Ubuntu 20.04 with corpus AF-VOIX-OFF-V... corpus:

Démarrage de TXM 0.8.2 (2021-11-17 16h09)…
TXM est prêt.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
16 occurrences.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
17 occurrences.
Concordance de <<div>[((_.div_titre-propre=".*(mutilé|blessé).*"%cd | _.div_resume=".*(mutilé|blessé).*"%cd | _.div_resume=".*invalide.*" | _.div_sequences=".*(mutilé|blessé).*"%cd | _.div_sequences=".*invalide.*") & _.div_descripteurs-aff-lig=".*((guerre mondiale)|(guerre d'Algérie)|(guerre d'Indochine)).*"%cd) | (_.div_titre-propre=".*grands? (mutilé|invalide)s?.*"%cd | _.div_resume=".*grands? (mutilé|invalide)s?.*"%cd | _.div_sequences=".*grands? (mutilé|invalide)s?.*"%cd) | (_.div_id="AFE85006160|AFE85003073|AFE86000295")]> dans le corpus AF-VOIX-OFF-V4-2021-05-19...
19 occurrences.
etc. etc.

BUG: A different number of matches is given for each call.

Diagnostic

Used glib library by a Linux TXM :

sudo cat /proc/25936/maps |grep "glib" 
7f20d03d3000-7f20d03f5000 r--p 00000000 fd:01 24774453                   /usr/share/locale-langpack/fr/LC_MESSAGES/glib20.mo
7f20d1402000-7f20d141e000 r--p 00000000 fd:01 22290140                   /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6
7f20d141e000-7f20d14a2000 r-xp 0001c000 fd:01 22290140                   /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6
7f20d14a2000-7f20d1528000 r--p 000a0000 fd:01 22290140                   /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6
7f20d1528000-7f20d1529000 r--p 00125000 fd:01 22290140                   /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6
7f20d1529000-7f20d152a000 rw-p 00126000 fd:01 22290140                   /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6

The console produces the following critical error message:

CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
CL: major error, invalid UTF8 string passed to cl_string_canonical...
etc. etc.

CQP Source Analysis

This message comes from the special-chars.c CQP source file (extract):

 1  if (charset == utf8) {
 2
 3    /* pointers for UTF8 processing */
 4    gchar *string = NULL;
 5    gchar *precomposed = NULL;
 6    gchar *folded = NULL;
 7    gchar *current_char;
 8    gchar *next_char_begins;
 9
10    int icase = (flags & IGNORE_CASE) != 0;
11    int idiac = (flags & IGNORE_DIAC) != 0;
12
13    /* UTF8 accent folding */
14    if (idiac) {
15      if (NULL == (string = g_utf8_normalize((gchar *)s, -1, G_NORMALIZE_NFD)) ) {
16        fprintf(stdout, "CL: major error, invalid UTF8 string passed to cl_string_canonical...\n");
17        return;
18      }
19
20      for (current_char = string; *current_char != '\0'; /* increment is done in-loop */) {
21        next_char_begins = g_utf8_next_char(current_char);
22        if (g_unichar_ismark(g_utf8_get_char(current_char))) {
23          /* downcopy to overwrite the mark character */
24          strcpy(current_char, next_char_begins);
25          /* and keep current_char the same */
26        }
27        else
28          current_char = next_char_begins;
29      }
30    }
31    /* end of accent folding */
32    else
33      string = (gchar *)s;
34
35    /* UTF8 precomposing -- always happens */
36    /* precomposed = g_utf8_normalize(string, -1, G_NORMALIZE_NFC); */ /* -- duplicate call to g_utf8_normalize() removed */
37    if (NULL == (precomposed = g_utf8_normalize(string, -1, G_NORMALIZE_NFC)) ) {
38      fprintf(stdout, "CL: major error, invalid UTF8 string passed to cl_string_canonical...\n");
39      return;
40    }
41...
42

The message is displayed in two cases: with or without using the '%d' CQL modifier (if (idiac)...).
This function comes from the glib library: https://docs.gtk.org/glib/func.utf8_normalize.html.
In each case, the g_utf8_normalize function doesn't work (returns NULL).

diagnostic #1

The CL error is catastrophic: the Java code receiving the result of this function should ABORT THE COMMAND in this case.

TO DO: add abort code to the Java calling code.

Binary Analysis

The debug/LibrariesVersions TXM macro lists the following TXM dynamic libraries:

/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libzip.so
/usr/lib/TXM-0.8.2beta/plugins/org.eclipse.equinox.launcher.gtk.linux.x86_64_1.1.800.v20180827-1352/eclipse_1705.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libnet.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libnio.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-pi3-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-cairo-gtk-4919.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-atk-gtk-4919.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libawt_xawt.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libawt.so
/home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libsunec.so
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libfontmanager.so
/home/sheiden/.TXM-0.8.2/configuration/org.eclipse.osgi/142/0/.cp/libswt-webkit-gtk-4919.so

Of which, /home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so corresponds to CQP.

Which effectively needs the g_utf8_normalize function from glib:

$ nm -e /home/sheiden/.TXM-0.8.2/plugins/org.txm.libs.cqp.linux_1.1.0.202111241119/res/linux64/libcqpjni.so | grep ' g_'
                 U g_access
                 U g_dir_close
                 U g_dir_open
                 U g_dir_read_name
                 U g_strreverse
                 U g_unichar_ismark
                 U g_utf8_casefold
                 U g_utf8_collate
                 U g_utf8_get_char
                 U g_utf8_normalize
                 U g_utf8_skip
                 U g_utf8_strreverse
                 U g_utf8_validate

Which glib library version is used in TXM 0.8.2beta on Ubuntu 20.04?

$ locate libglib|grep '^/usr/lib'
/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so
/usr/lib/cli/glib-sharp-2.0/libglibsharpglue-2.so
/usr/lib/i386-linux-gnu/libglib-2.0.so.0
/usr/lib/i386-linux-gnu/libglib-2.0.so.0.6400.6
/usr/lib/x86_64-linux-gnu/libglib-2.0.a
/usr/lib/x86_64-linux-gnu/libglib-2.0.so
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.6400.6
/usr/lib/x86_64-linux-gnu/libglibmm-2.4.so.1
/usr/lib/x86_64-linux-gnu/libglibmm-2.4.so.1.3.0
/usr/lib/x86_64-linux-gnu/libglibmm_generate_extra_defs-2.4.so.1
/usr/lib/x86_64-linux-gnu/libglibmm_generate_extra_defs-2.4.so.1.3.0

/usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so is the usual suspect.

And effectively, this library provides the binary text for g_utf8_normalize:

nm -e /usr/lib/TXM-0.8.2beta/jre/lib/amd64/libglib-lite.so|grep g_utf8_normalize
0000000000073ede T g_utf8_normalize
0000000000073b01 T _g_utf8_normalize_wc

Diagnostic #2

Is the version of this glib-lite compatible with libcqpjni.so?


Related issues

related to Bug #3211: CWB, return error for "CL: major error" error messages New 01/18/2022

History

#1 Updated by Serge Heiden about 1 year ago

  • Subject changed from Ubuntu 20.04, instable Concordance result to Ubuntu 20.04, unstable Concordance result

#2 Updated by Serge Heiden about 1 year ago

  • Priority changed from Normal to Urgent

#3 Updated by Serge Heiden about 1 year ago

  • Description updated (diff)

#4 Updated by Matthieu Decorde about 1 year ago

  • Subject changed from Ubuntu 20.04, unstable Concordance result to Ubuntu 20.04&Mac OS X, unstable Concordance result
  • Description updated (diff)

#5 Updated by Matthieu Decorde about 1 year ago

  • Description updated (diff)

#6 Updated by Serge Heiden about 1 year ago

  • Description updated (diff)

#7 Updated by Matthieu Decorde about 1 year ago

  • Description updated (diff)

#8 Updated by Matthieu Decorde about 1 year ago

  • % Done changed from 0 to 20

fix using 'utf8cpy' instead of 'strcpy' seems to work on Linux

char* utf8cpy(char* dst, const char* src, size_t sizeDest )
{
    if( sizeDest ){
        size_t sizeSrc = strlen(src); // number of bytes not including null
        while( sizeSrc >= sizeDest ){

            const char* lastByte = src + sizeSrc; // Initially, pointing to the null terminator.
            while( lastByte-- > src )
                if((*lastByte & 0xC0) != 0x80) // Found the initial byte of the (potentially) multi-byte character (or found null).
                    break;

            sizeSrc = lastByte - src;
        }
        memcpy(dst, src, sizeSrc);
        dst[sizeSrc] = '\0';
    }
    return dst;
}

#9 Updated by Matthieu Decorde about 1 year ago

  • % Done changed from 20 to 70

linux&windown&macosx builds done

#10 Updated by Matthieu Decorde about 1 month ago

  • Target version changed from TXM 0.8.2 to TXM 0.8.3

Also available in: Atom PDF