Bug #2525
RCP: 0.7.9, English tokenisation: apostrophe is not managed well when the word is capitalized or when it is followed by a punctuation
Statut: | New | Début: | 25/03/2019 | |
---|---|---|---|---|
Priorité: | Normal | Echéance: | ||
Assigné à: | - | % réalisé: | 0% |
|
Catégorie: | TAL | Temps passé: | - | |
Version cible: | - |
Description
Examples :
"Bloom's," is tokenized "Bloom's / ,"
"Bloom's)" is tokenized "Bloom's / )"
"Bloom's:" is tokenized "Bloom's / :"
but every other occurrence of "Bloom's" is tokenized "Bloom / 's".
And "BLOOM'S" is tokenized as one word, as well as "BOYLAN'S", "DUBLIN'S", "LENEHAN'S", "MARION'S", "THAT'S", "VIRAG'S". I count 2502 occurrences of "'s" in my corpus, and no occurrences of "'S".
Corpus used : Ulysses, from James Joyce, from Gutenberg project, UTF-8 edition
http://www.gutenberg.org/files/4300/4300-0.txt
Useful queries to see the problem : INDEX on
[word="..+'.+"][]
[word="..+'.+"]@[]