Bug #917

Updated by Serge Heiden about 5 years ago

*Issue* In an old Greek corpus, alphabetical sort of right context doesn't follow [old] Greek collation rules defined by the Unicode consortium for that writing system.

The output of sorting the right context of a Concordance of [word="πυρετὸς"] with left context to 0 and right context to 1 is currently (selected lines):
<pre>
Epid_V πυρετὸς αὖθις
Epid_V πυρετὸς βληχρός
Epid_V πυρετὸς δὲ
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς ξυνεχὴς
Epid_V πυρετὸς οὐ
Epid_V πυρετὸς οὐκ
Epid_V πυρετὸς παρείπετο
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>

But:
<pre>
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
</pre>
lines, should be immediately followed by the following lines: the:
<pre>
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>
lines.


*Origin* The 'lang' property of the corpus set to 'grc' (old Greek) or to 'el' (modern Greek) in 'import.xml' binary doesn't change the Java collation rules behavior in TXM.

*Solution* Currently no solution.

*Status* Currently TXM sorts Concordance contexts by word property strings, so we We need to check Java collation system for 'grc' or 'el' languages.

languages: The following word list should be correctly sorted:
<pre>
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἐπέβαλε
ἐπέλαβε
ἐπέλαβεν
ἐπεγίνετο
ἐπεῖχε
ἔλαβε
ξυνεχὴς
οὐ
οὐκ
παρείπετο
</pre>

An evolution of TXM Concordance sort will be to sort numerically by words property values codes (and not strings). CQP word properties indexes are already alphabetically sorted at corpus import, so a concordance sort should only use integer sort.

*See also*

* See "Unicode collation FAQ":http://www.unicode.org/faq/collation.html
* "Unicode collation algorithm Demo,
Collation Algorithm Demo" in Java":http://www.unicode.org/reports/tr10/Sample Java: http://www.unicode.org/reports/tr10/Sample.
* "Oracle Java Comparing Strings - i18n/text/collationintro":http://docs.oracle.com/javase/tutorial/i18n/text/collationintro.html
* "ICU - International Components for Unicode":http://site.icu-project.org
* "Wikipedia Unicode collation algorithm":http://en.wikipedia.org/wiki/Unicode_collation_algorithm

Back