Bug #2525
RCP: 0.7.9, English tokenisation: apostrophe is not managed well when the word is capitalized or when it is followed by a punctuation
Status: | New | Start date: | 03/25/2019 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | TAL | Spent time: | - | |
Target version: | - |
Description
Examples :
"Bloom's," is tokenized "Bloom's / ,"
"Bloom's)" is tokenized "Bloom's / )"
"Bloom's:" is tokenized "Bloom's / :"
but every other occurrence of "Bloom's" is tokenized "Bloom / 's".
And "BLOOM'S" is tokenized as one word, as well as "BOYLAN'S", "DUBLIN'S", "LENEHAN'S", "MARION'S", "THAT'S", "VIRAG'S". I count 2502 occurrences of "'s" in my corpus, and no occurrences of "'S".
Corpus used : Ulysses, from James Joyce, from Gutenberg project, UTF-8 edition
http://www.gutenberg.org/files/4300/4300-0.txt
Useful queries to see the problem : INDEX on
[word="..+'.+"][]
[word="..+'.+"]@[]