Bug #917

Updated by Serge Heiden almost 5 years ago

*Issue* In an old Greek corpus, alphabetical sort of right context doesn't follow [old] Greek collation rules defined by the Unicode consortium for that writing system.

The output of sorting the right context of a Concordance of [word="πυρετὸς"] with left context size set to 0 and right context size set to 1 is currently (selected lines):
<pre>
Epid_V πυρετὸς αὖθις
Epid_V πυρετὸς βληχρός
Epid_V πυρετὸς δὲ
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς ξυνεχὴς
Epid_V πυρετὸς οὐ
Epid_V πυρετὸς οὐκ
Epid_V πυρετὸς παρείπετο
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>

But:
<pre>
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
</pre>
lines, should be immediately followed by the following lines:
<pre>
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>

*Origin* The 'lang' property of the corpus set to 'grc' (old Greek) or to 'el' (modern Greek) in 'import.xml' binary doesn't change the Java collation rules behavior in TXM.

*Solution* Currently no solution.

*Status* Currently TXM sorts Concordance contexts by word property strings, so we need to check Java collation system for 'grc' or 'el' languages.

The following word list should be correctly sorted:
<pre>
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἐπέβαλε
ἐπέλαβε
ἐπέλαβεν
ἐπεγίνετο
ἐπεῖχε
ἔλαβε
ξυνεχὴς
οὐ
οὐκ
παρείπετο
</pre>

*Testing macro*
We created the CollatorTester macro that allows to test the different options of the Collator Java class with a list of words. Its parameters are:
* str = a string of the words to sort separated by a space character
* locale : See http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
* strength (the exact assignment of strengths is locale dependant. For example, in Czech, "e" and "f" are considered primary differences, while "e" and "ě" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical) :
** 0 = PRIMARY (a common example is for different base letters ("a" vs "b") to be considered a PRIMARY difference)
** 1 = SECONDARY (a common example is for different accented forms of the same base letter ("a" vs "ä") to be considered a SECONDARY difference
** 2 = TERTIARY (a common example is for case differences ("a" vs "A") to be considered a TERTIARY difference)
** ? = IDENTICAL (a common example is for control characters ("\u0001" vs "\u0002") to be considered equal at the PRIMARY, SECONDARY, and TERTIARY levels but different at the IDENTICAL level. Additionally, differences between pre-composed accents such as "\u00C0" (A-grave) and combining accents such as "A\u0300" (A, combining-grave) will be considered significant at the IDENTICAL level if decomposition is set to NO_DECOMPOSITION)
* decomposition mode :
** 0 = NO_DECOMPOSITION (accented characters will not be decomposed for collation)
** 1 = CANONICAL_DECOMPOSITION (characters that are canonical variants according to Unicode standard will be decomposed for collation. This should be used to get correct collation of accented characters)
** 2 = FULL_DECOMPOSITION (both Unicode canonical variants and Unicode compatibility variants will be decomposed for collation. This causes not only accented characters to be collated, but also characters that have special formats to be collated with their norminal form. For example, the half-width and full-width ASCII and Katakana characters are then collated together)

Results: (locale, strength, decomposition)
<pre> <code>
'el', '2', '2' (any vallue of strength or decomposition gives good sort)
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἔλαβε
ἐπέβαλε
ἐπεγίνετο
ἐπεῖχε
ἐπέλαβε
ἐπέλαβεν
ξυνεχὴς
οὐ
οὐκ
παρείπετο
</pre> </code>

<pre> <code>
'fr', '2', '2' (good sort)
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἔλαβε
ἐπέβαλε
ἐπεγίνετο
ἐπεῖχε
ἐπέλαβε
ἐπέλαβεν
ξυνεχὴς
οὐ
οὐκ
παρείπετο
</pre> </code>

<pre> <code>
'fr', '2', '0' (bad sort when decomposition is '0')
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ξυνεχὴς
οὐ
οὐκ
παρείπετο
ἐπέβαλε
ἐπέλαβε
ἐπέλαβεν
ἐπεγίνετο
ἐπεῖχε
ἔλαβε
</pre> </code>

*Conclusion*

'locale' and 'decomposition' are the most important parameters.

*Recommandation*

Use <locale>, 2, 2

*Evolution*

An evolution of TXM Concordance sort will be to sort numerically by words property values integer codes (and not strings). CQP word properties indexes are already alphabetically sorted at corpus import, so a concordance sort should only use integer sort. For this, Concordances should be stored as integers and not as strings.

*See also*
* "TLG Technical Note 002: Greek Sort Order":https://www.tlg.uci.edu/help/Doc002.html
* "Unicode collation FAQ":http://www.unicode.org/faq/collation.html
* "Unicode collation algorithm Demo, in Java":http://www.unicode.org/reports/tr10/Sample
* "Oracle Java Comparing Strings - i18n/text/collationintro":http://docs.oracle.com/javase/tutorial/i18n/text/collationintro.html
* "ICU - International Components for Unicode":http://site.icu-project.org
* "Wikipedia Greek diacritics":http://en.wikipedia.org/wiki/Greek_diacritics
* "Wikipedia Unicode collation algorithm":http://en.wikipedia.org/wiki/Unicode_collation_algorithm

Back