Bug #917: RCP: 0.7.6beta, Concordance alphabetical sort - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #917

RCP: 0.7.6beta, Concordance alphabetical sort

Ajouté par Serge Heiden il y a plus de 11 ans. Mis à jour il y a environ 11 ans.

Statut:

New

Début:

07/07/2014

Priorité:

Normal

Echéance:

Assigné à:

% réalisé:

Catégorie:

Toolbox

Temps passé:

Version cible:

TXM 0.X.X

Description

Issue In an old Greek corpus, alphabetical sort of right context doesn't follow [old] Greek collation rules defined by the Unicode consortium for that writing system.

The output of sorting the right context of a Concordance of [word="πυρετὸς"] with left context size set to 0 and right context size set to 1 is currently (selected lines):

    Epid_V        πυρετὸς    αὖθις
    Epid_V        πυρετὸς    βληχρός
    Epid_V        πυρετὸς    δὲ
    Epid_V        πυρετὸς    εἶχε
    Epid_V        πυρετὸς    εἶχεν
    Epid_V        πυρετὸς    εἶχεν
    Epid_V        πυρετὸς    εἶχεν
    Epid_V        πυρετὸς    ξυνεχὴς
    Epid_V        πυρετὸς    οὐ
    Epid_V        πυρετὸς    οὐκ
    Epid_V        πυρετὸς    παρείπετο
    Epid_V        πυρετὸς    ἐπέβαλε
    Epid_V        πυρετὸς    ἐπέλαβε
    Epid_V        πυρετὸς    ἐπέλαβε
    Epid_V        πυρετὸς    ἐπέλαβεν
    Epid_V        πυρετὸς    ἐπεγίνετο
    Epid_V        πυρετὸς    ἐπεῖχε
    Epid_V        πυρετὸς    ἔλαβε

But:

    Epid_V        πυρετὸς    εἶχε
    Epid_V        πυρετὸς    εἶχεν
    Epid_V        πυρετὸς    εἶχεν
    Epid_V        πυρετὸς    εἶχεν

lines, should be immediately followed by the following lines:

    Epid_V        πυρετὸς    ἐπέβαλε
    Epid_V        πυρετὸς    ἐπέλαβε
    Epid_V        πυρετὸς    ἐπέλαβε
    Epid_V        πυρετὸς    ἐπέλαβεν
    Epid_V        πυρετὸς    ἐπεγίνετο
    Epid_V        πυρετὸς    ἐπεῖχε
    Epid_V        πυρετὸς    ἔλαβε

Origin The 'lang' property of the corpus set to 'grc' (old Greek) or to 'el' (modern Greek) in 'import.xml' binary doesn't change the Java collation rules behavior in TXM.

Solution Use the "el" language code.

Status Currently TXM sorts Concordance contexts by word property strings, so we need to check Java collation system for 'grc' or 'el' languages.

The following word list should be correctly sorted:

αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἐπέβαλε
ἐπέλαβε
ἐπέλαβεν
ἐπεγίνετο
ἐπεῖχε
ἔλαβε
ξυνεχὴς
οὐ
οὐκ
παρείπετο

Testing macro
We created the CollatorTesterMacro.groovy macro to test the different options of the Collator Java class with a list of words: http://sourceforge.net/projects/txm/files/software/TXM%20macros/CollatorTesterMacro.groovy/download

Its parameters are:

str = a string of the words to sort separated by a space character
locale : See http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
strength (the exact assignment of strengths is locale dependant. For example, in Czech, "e" and "f" are considered primary differences, while "e" and "ě" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical) :
- 0 = PRIMARY (a common example is for different base letters ("a" vs "b") to be considered a PRIMARY difference)
- 1 = SECONDARY (a common example is for different accented forms of the same base letter ("a" vs "ä") to be considered a SECONDARY difference
- 2 = TERTIARY (a common example is for case differences ("a" vs "A") to be considered a TERTIARY difference)
- ? = IDENTICAL (a common example is for control characters ("\u0001" vs "\u0002") to be considered equal at the PRIMARY, SECONDARY, and TERTIARY levels but different at the IDENTICAL level. Additionally, differences between pre-composed accents such as "\u00C0" (A-grave) and combining accents such as "A\u0300" (A, combining-grave) will be considered significant at the IDENTICAL level if decomposition is set to NO_DECOMPOSITION)
decomposition mode :
- 0 = NO_DECOMPOSITION (accented characters will not be decomposed for collation)
- 1 = CANONICAL_DECOMPOSITION (characters that are canonical variants according to Unicode standard will be decomposed for collation. This should be used to get correct collation of accented characters)
- 2 = FULL_DECOMPOSITION (both Unicode canonical variants and Unicode compatibility variants will be decomposed for collation. This causes not only accented characters to be collated, but also characters that have special formats to be collated with their norminal form. For example, the half-width and full-width ASCII and Katakana characters are then collated together)

Results: (locale, strength, decomposition)

'el', '2', '2' (any value of strength or decomposition gives good sort)
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἔλαβε
ἐπέβαλε
ἐπεγίνετο
ἐπεῖχε
ἐπέλαβε
ἐπέλαβεν
ξυνεχὴς
οὐ
οὐκ
παρείπετο

'fr', '2', '2' (good sort)
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἔλαβε
ἐπέβαλε
ἐπεγίνετο
ἐπεῖχε
ἐπέλαβε
ἐπέλαβεν
ξυνεχὴς
οὐ
οὐκ
παρείπετο

'fr', '2', '0' (bad sort when decomposition is set to '0')
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ξυνεχὴς
οὐ
οὐκ
παρείπετο
ἐπέβαλε
ἐπέλαβε
ἐπέλαβεν
ἐπεγίνετο
ἐπεῖχε
ἔλαβε

Conclusion

'locale' and 'decomposition' are the most important parameters.

Recommandation

Use <locale>, 2, 2

Evolution

An evolution of TXM Concordance sort will be to sort numerically by words property values integer codes (and not strings). CQP word properties indexes are already alphabetically sorted at corpus import, so a concordance sort should only use integer sort. For this, Concordances should be stored as integers and not as strings.

See also

Historique

#1 Mis à jour par Serge Heiden il y a plus de 11 ans

Description mis à jour (diff)

#2 Mis à jour par Serge Heiden il y a plus de 11 ans

Description mis à jour (diff)

#3 Mis à jour par Serge Heiden il y a plus de 11 ans

Description mis à jour (diff)

#4 Mis à jour par Serge Heiden il y a plus de 11 ans

Description mis à jour (diff)

#5 Mis à jour par Serge Heiden il y a plus de 11 ans

Description mis à jour (diff)

#6 Mis à jour par Matthieu Decorde il y a plus de 11 ans

First, there is one bug that throw a IllegalArgumentException (or NullPointerException depending the Java version) during the sort of the right context column.

Second, after fixing this bug, when the "el" locale is set the sort seems correct

    mini        a    αὖθις a βληχρός a δὲ a εἶχε a εἶχεν a βληχρός a
    mini        a    βληχρός a δὲ a εἶχε a εἶχεν a βληχρός a ἐπέβαλε a
    mini        a    βληχρός a ἐπέβαλε a ἐπεῖχε a ἔλαβε a ξυνεχὴς a οὐ a
    mini        a    δὲ a εἶχε a εἶχεν a βληχρός a ἐπέβαλε a ἐπεῖχε a
    mini        a    εἶχε a εἶχεν a βληχρός a ἐπέβαλε a ἐπεῖχε a ἔλαβε a
    mini        a    εἶχεν a βληχρός a ἐπέβαλε a ἐπεῖχε a ἔλαβε a ξυνεχὴς a
    mini        a    ἔλαβε a ξυνεχὴς a οὐ a παρείπετο a ἐπέλαβε a οὐκ a
    mini        a    ἐπέβαλε a ἐπεῖχε a ἔλαβε a ξυνεχὴς a οὐ a παρείπετο a
    mini        a    ἐπεγίνετο           
    mini        a    ἐπεῖχε a ἔλαβε a ξυνεχὴς a οὐ a παρείπετο a ἐπέλαβε a
    mini        a    ἐπέλαβε a οὐκ a ἐπέλαβεν a ἐπεγίνετο     
    mini        a    ἐπέλαβεν a ἐπεγίνετο         
    mini        a    ξυνεχὴς a οὐ a παρείπετο a ἐπέλαβε a οὐκ a ἐπέλαβεν a
    mini        a    οὐ a παρείπετο a ἐπέλαβε a οὐκ a ἐπέλαβεν a ἐπεγίνετο

But the "grc" locale does not produce the right sort ("ἐ" words are separated from the "ε" words)