Bug #2525

RCP: 0.7.9, English tokenisation: apostrophe is not managed well when the word is capitalized or when it is followed by a punctuation

Added by Benedicte Pincemin 9 months ago.

Status:New Start date:03/25/2019
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:TAL Spent time: -
Target version:-

Description

Examples :
"Bloom's," is tokenized "Bloom's / ,"
"Bloom's)" is tokenized "Bloom's / )"
"Bloom's:" is tokenized "Bloom's / :"
but every other occurrence of "Bloom's" is tokenized "Bloom / 's".
And "BLOOM'S" is tokenized as one word, as well as "BOYLAN'S", "DUBLIN'S", "LENEHAN'S", "MARION'S", "THAT'S", "VIRAG'S". I count 2502 occurrences of "'s" in my corpus, and no occurrences of "'S".

Corpus used : Ulysses, from James Joyce, from Gutenberg project, UTF-8 edition
http://www.gutenberg.org/files/4300/4300-0.txt

Useful queries to see the problem : INDEX on
[word="..+'.+"][]
[word="..+'.+"]@[]

Also available in: Atom PDF