Bug #917

Updated by Matthieu Decorde over 5 years ago

*Issue* In an old Greek corpus, alphabetical sort of right context doesn't follow [old] Greek collation rules defined by the Unicode consortium for that writing system.

The output of sorting the right context of a Concordance of [word="πυρετὸς"] with left context size set to 0 and right context size set to 1 is currently (selected lines):
<pre>
Epid_V πυρετὸς αὖθις
Epid_V πυρετὸς βληχρός
Epid_V πυρετὸς δὲ
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς ξυνεχὴς
Epid_V πυρετὸς οὐ
Epid_V πυρετὸς οὐκ
Epid_V πυρετὸς παρείπετο
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>

But:
<pre>
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
</pre>
lines, should be immediately followed by the following lines:
<pre>
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>

*Origin* The 'lang' property of the corpus set to 'grc' (old Greek) or to 'el' (modern Greek) in 'import.xml' binary doesn't change the Java collation rules behavior in TXM.

*Solution* Currently no solution.

*Status* Currently TXM sorts Concordance contexts by word property strings, so we need to check Java collation system for 'grc' or 'el' languages.

The following word list should be correctly sorted:
<pre>
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἐπέβαλε
ἐπέλαβε
ἐπέλαβεν
ἐπεγίνετο
ἐπεῖχε
ἔλαβε
ξυνεχὴς
οὐ
οὐκ
παρείπετο
</pre>

*Testing * Testing macro*
We created the CollatorTester macro that allows to test the different options of the Collator Java class with a list of words. The options are the locale, the strengh and the decomposition mode.

*Evolution*

An evolution of TXM Concordance sort will be to sort numerically by words property values integer codes (and not strings). CQP word properties indexes are already alphabetically sorted at corpus import, so a concordance sort should only use integer sort. For this, Concordances should be stored as integers and not as strings.

*See also*
* "TLG Technical Note 002: Greek Sort Order":https://www.tlg.uci.edu/help/Doc002.html
* "Unicode collation FAQ":http://www.unicode.org/faq/collation.html
* "Unicode collation algorithm Demo, in Java":http://www.unicode.org/reports/tr10/Sample
* "Oracle Java Comparing Strings - i18n/text/collationintro":http://docs.oracle.com/javase/tutorial/i18n/text/collationintro.html
* "ICU - International Components for Unicode":http://site.icu-project.org
* "Wikipedia Greek diacritics":http://en.wikipedia.org/wiki/Greek_diacritics
* "Wikipedia Unicode collation algorithm":http://en.wikipedia.org/wiki/Unicode_collation_algorithm

Back