Bug #917
RCP: 0.7.6beta, Concordance alphabetical sort
Status: | New | Start date: | 07/07/2014 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Toolbox | Spent time: | - | |
Target version: | TXM X.X |
Description
Issue In an old Greek corpus, alphabetical sort of right context doesn't follow [old] Greek collation rules defined by the Unicode consortium for that writing system.
The output of sorting the right context of a Concordance of [word="πυρετὸς"] with left context size set to 0 and right context size set to 1 is currently (selected lines):
Epid_V πυρετὸς αὖθις Epid_V πυρετὸς βληχρός Epid_V πυρετὸς δὲ Epid_V πυρετὸς εἶχε Epid_V πυρετὸς εἶχεν Epid_V πυρετὸς εἶχεν Epid_V πυρετὸς εἶχεν Epid_V πυρετὸς ξυνεχὴς Epid_V πυρετὸς οὐ Epid_V πυρετὸς οὐκ Epid_V πυρετὸς παρείπετο Epid_V πυρετὸς ἐπέβαλε Epid_V πυρετὸς ἐπέλαβε Epid_V πυρετὸς ἐπέλαβε Epid_V πυρετὸς ἐπέλαβεν Epid_V πυρετὸς ἐπεγίνετο Epid_V πυρετὸς ἐπεῖχε Epid_V πυρετὸς ἔλαβε
But:
Epid_V πυρετὸς εἶχε Epid_V πυρετὸς εἶχεν Epid_V πυρετὸς εἶχεν Epid_V πυρετὸς εἶχεν
lines, should be immediately followed by the following lines:
Epid_V πυρετὸς ἐπέβαλε Epid_V πυρετὸς ἐπέλαβε Epid_V πυρετὸς ἐπέλαβε Epid_V πυρετὸς ἐπέλαβεν Epid_V πυρετὸς ἐπεγίνετο Epid_V πυρετὸς ἐπεῖχε Epid_V πυρετὸς ἔλαβε
Origin The 'lang' property of the corpus set to 'grc' (old Greek) or to 'el' (modern Greek) in 'import.xml' binary doesn't change the Java collation rules behavior in TXM.
Solution Use the "el" language code.
Status Currently TXM sorts Concordance contexts by word property strings, so we need to check Java collation system for 'grc' or 'el' languages.
The following word list should be correctly sorted:
αὖθις βληχρός δὲ εἶχε εἶχεν ἐπέβαλε ἐπέλαβε ἐπέλαβεν ἐπεγίνετο ἐπεῖχε ἔλαβε ξυνεχὴς οὐ οὐκ παρείπετο
Testing macro
We created the CollatorTesterMacro.groovy macro to test the different options of the Collator Java class with a list of words: http://sourceforge.net/projects/txm/files/software/TXM%20macros/CollatorTesterMacro.groovy/download
- str = a string of the words to sort separated by a space character
- locale : See http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
- strength (the exact assignment of strengths is locale dependant. For example, in Czech, "e" and "f" are considered primary differences, while "e" and "ě" are secondary differences, "e" and "E" are tertiary differences and "e" and "e" are identical) :
- 0 = PRIMARY (a common example is for different base letters ("a" vs "b") to be considered a PRIMARY difference)
- 1 = SECONDARY (a common example is for different accented forms of the same base letter ("a" vs "ä") to be considered a SECONDARY difference
- 2 = TERTIARY (a common example is for case differences ("a" vs "A") to be considered a TERTIARY difference)
- ? = IDENTICAL (a common example is for control characters ("\u0001" vs "\u0002") to be considered equal at the PRIMARY, SECONDARY, and TERTIARY levels but different at the IDENTICAL level. Additionally, differences between pre-composed accents such as "\u00C0" (A-grave) and combining accents such as "A\u0300" (A, combining-grave) will be considered significant at the IDENTICAL level if decomposition is set to NO_DECOMPOSITION)
- decomposition mode :
- 0 = NO_DECOMPOSITION (accented characters will not be decomposed for collation)
- 1 = CANONICAL_DECOMPOSITION (characters that are canonical variants according to Unicode standard will be decomposed for collation. This should be used to get correct collation of accented characters)
- 2 = FULL_DECOMPOSITION (both Unicode canonical variants and Unicode compatibility variants will be decomposed for collation. This causes not only accented characters to be collated, but also characters that have special formats to be collated with their norminal form. For example, the half-width and full-width ASCII and Katakana characters are then collated together)
Results: (locale, strength, decomposition)
'el', '2', '2' (any value of strength or decomposition gives good sort) αὖθις βληχρός δὲ εἶχε εἶχεν ἔλαβε ἐπέβαλε ἐπεγίνετο ἐπεῖχε ἐπέλαβε ἐπέλαβεν ξυνεχὴς οὐ οὐκ παρείπετο
'fr', '2', '2' (good sort) αὖθις βληχρός δὲ εἶχε εἶχεν ἔλαβε ἐπέβαλε ἐπεγίνετο ἐπεῖχε ἐπέλαβε ἐπέλαβεν ξυνεχὴς οὐ οὐκ παρείπετο
'fr', '2', '0' (bad sort when decomposition is set to '0') αὖθις βληχρός δὲ εἶχε εἶχεν ξυνεχὴς οὐ οὐκ παρείπετο ἐπέβαλε ἐπέλαβε ἐπέλαβεν ἐπεγίνετο ἐπεῖχε ἔλαβε
Conclusion
'locale' and 'decomposition' are the most important parameters.
Recommandation
Use <locale>, 2, 2
Evolution
An evolution of TXM Concordance sort will be to sort numerically by words property values integer codes (and not strings). CQP word properties indexes are already alphabetically sorted at corpus import, so a concordance sort should only use integer sort. For this, Concordances should be stored as integers and not as strings.
See alsoHistory
#1 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#2 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#3 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#4 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#5 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#6 Updated by Matthieu Decorde over 9 years ago
First, there is one bug that throw a IllegalArgumentException (or NullPointerException depending the Java version) during the sort of the right context column.
Second, after fixing this bug, when the "el" locale is set the sort seems correct
mini a αὖθις a βληχρός a δὲ a εἶχε a εἶχεν a βληχρός a mini a βληχρός a δὲ a εἶχε a εἶχεν a βληχρός a ἐπέβαλε a mini a βληχρός a ἐπέβαλε a ἐπεῖχε a ἔλαβε a ξυνεχὴς a οὐ a mini a δὲ a εἶχε a εἶχεν a βληχρός a ἐπέβαλε a ἐπεῖχε a mini a εἶχε a εἶχεν a βληχρός a ἐπέβαλε a ἐπεῖχε a ἔλαβε a mini a εἶχεν a βληχρός a ἐπέβαλε a ἐπεῖχε a ἔλαβε a ξυνεχὴς a mini a ἔλαβε a ξυνεχὴς a οὐ a παρείπετο a ἐπέλαβε a οὐκ a mini a ἐπέβαλε a ἐπεῖχε a ἔλαβε a ξυνεχὴς a οὐ a παρείπετο a mini a ἐπεγίνετο mini a ἐπεῖχε a ἔλαβε a ξυνεχὴς a οὐ a παρείπετο a ἐπέλαβε a mini a ἐπέλαβε a οὐκ a ἐπέλαβεν a ἐπεγίνετο mini a ἐπέλαβεν a ἐπεγίνετο mini a ξυνεχὴς a οὐ a παρείπετο a ἐπέλαβε a οὐκ a ἐπέλαβεν a mini a οὐ a παρείπετο a ἐπέλαβε a οὐκ a ἐπέλαβεν a ἐπεγίνετο
But the "grc" locale does not produce the right sort ("ἐ" words are separated from the "ε" words)
#7 Updated by Matthieu Decorde over 9 years ago
- Description updated (diff)
#8 Updated by Matthieu Decorde over 9 years ago
- Description updated (diff)
#9 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#10 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#11 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#12 Updated by Matthieu Decorde over 9 years ago
- Description updated (diff)
#13 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#14 Updated by Serge Heiden over 9 years ago
- Description updated (diff)
#15 Updated by Matthieu Decorde about 9 years ago
- Target version changed from TXM 0.7.6 to TXM X.X