Bug #2525: RCP: 0.7.9, English tokenisation: apostrophe is not managed well when the word is capitalized or when it is followed by a punctuation - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #2525

RCP: 0.7.9, English tokenisation: apostrophe is not managed well when the word is capitalized or when it is followed by a punctuation

Ajouté par Benedicte Pincemin il y a plus de 6 ans.

Statut:

New

Début:

25/03/2019

Priorité:

Normal

Echéance:

Assigné à:

% réalisé:

Catégorie:

TAL

Temps passé:

Version cible:

Description

Examples :
"Bloom's," is tokenized "Bloom's / ,"
"Bloom's)" is tokenized "Bloom's / )"
"Bloom's:" is tokenized "Bloom's / :"
but every other occurrence of "Bloom's" is tokenized "Bloom / 's".
And "BLOOM'S" is tokenized as one word, as well as "BOYLAN'S", "DUBLIN'S", "LENEHAN'S", "MARION'S", "THAT'S", "VIRAG'S". I count 2502 occurrences of "'s" in my corpus, and no occurrences of "'S".

Corpus used : Ulysses, from James Joyce, from Gutenberg project, UTF-8 edition
http://www.gutenberg.org/files/4300/4300-0.txt

Useful queries to see the problem : INDEX on
[word="..+'.+"][]
[word="..+'.+"]@[]

Formats disponibles : Atom PDF

Laboratoire ICAR » Plateforme TXM

Demandes

Rapports personnalisés

Bug #2525

RCP: 0.7.9, English tokenisation: apostrophe is not managed well when the word is capitalized or when it is followed by a punctuation