Bug #917: RCP: 0.7.6beta, Concordance alphabetical sort - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #917

Mis à jour par Serge Heiden il y a environ 11 ans

*Issue* In an old Greek corpus, alphabetical sort of right context doesn't follow [old] Greek collation rules defined by the Unicode consortium for that writing system.

The output of sorting the right context of a Concordance of [word="πυρετὸς"] with left context to 0 and right context to 1 is currently (selected lines):
<pre>
Epid_V πυρετὸς αὖθις
Epid_V πυρετὸς βληχρός
Epid_V πυρετὸς δὲ
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς ξυνεχὴς
Epid_V πυρετὸς οὐ
Epid_V πυρετὸς οὐκ
Epid_V πυρετὸς παρείπετο
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>

But:
<pre>
Epid_V πυρετὸς εἶχε
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
Epid_V πυρετὸς εἶχεν
</pre>
lines, should be immediately followed by the following lines:
<pre>
Epid_V πυρετὸς ἐπέβαλε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβε
Epid_V πυρετὸς ἐπέλαβεν
Epid_V πυρετὸς ἐπεγίνετο
Epid_V πυρετὸς ἐπεῖχε
Epid_V πυρετὸς ἔλαβε
</pre>

*Origin* The 'lang' property of the corpus set to 'grc' (old Greek) or to 'el' (modern Greek) in 'import.xml' binary doesn't change the Java collation rules behavior in TXM.

*Solution* Currently no solution.

*Status* Currently TXM sorts Concordance contexts by word property strings, so we need to check Java collation system for 'grc' or 'el' languages.

The following word list should be correctly sorted:
<pre>
αὖθις
βληχρός
δὲ
εἶχε
εἶχεν
ἐπέβαλε
ἐπέλαβε
ἐπέλαβεν
ἐπεγίνετο
ἐπεῖχε
ἔλαβε
ξυνεχὴς
οὐ
οὐκ
παρείπετο
</pre>

An evolution of TXM Concordance sort will be to sort numerically by words property values integer codes (and not strings). CQP word properties indexes are already alphabetically sorted at corpus import, so a concordance sort should only use integer sort. For this, Concordances should stored as integers and not as strings.

*See also*
* "TLG Technical Note 002: Greek Sort Order":https://www.tlg.uci.edu/help/Doc002.html
* "Unicode collation FAQ":http://www.unicode.org/faq/collation.html
* "Unicode collation algorithm Demo, in Java":http://www.unicode.org/reports/tr10/Sample
* "Oracle Java Comparing Strings - i18n/text/collationintro":http://docs.oracle.com/javase/tutorial/i18n/text/collationintro.html
* "ICU - International Components for Unicode":http://site.icu-project.org
* "Wikipedia Greek diacritics":http://en.wikipedia.org/wiki/Greek_diacritics
* "Wikipedia Unicode collation algorithm":http://en.wikipedia.org/wiki/Unicode_collation_algorithm

Retour

Laboratoire ICAR » Plateforme TXM

Bug #917