Bug #1592
Mis à jour par Matthieu Decorde il y a environ 9 ans
2 main bugs has been found :
"Bar-le-Duc" is tokenized : "Bar" "-l"
"-Duc" is dropped and "-le" is truncated
Qu'est-ce is tokenized "Qu'" "-ce"
"est" is dropped
"mont-d'or" is tokenized "mont-d'" "or" instead of "mont-d'or"
h3. Solution
There was an error in the french clitic regular expression and the neighborhood characters of the clitics was not fully processed (no iteration).
There was an error in the elision regular expression the "X'" elision must be at the begining of the token.
h3. Validation test
Import with clipboard in FR with the following content:
<pre>
le mot Bar-le-Duc et le mot mont-d'or
l'apostrophe ne pose pas d'problème !
</pre>
"Bar-le-Duc" and "mont-d'or" should be words.
The word lexicon should be
<pre>
le
mot
Bar-le-Duc
et
le
mot
mont-d'or
l'
apostrophe
ne
pose
pas
d'
problème
!
</pre>
"Bar-le-Duc" is tokenized : "Bar" "-l"
"-Duc" is dropped and "-le" is truncated
Qu'est-ce is tokenized "Qu'" "-ce"
"est" is dropped
"mont-d'or" is tokenized "mont-d'" "or" instead of "mont-d'or"
h3. Solution
There was an error in the french clitic regular expression and the neighborhood characters of the clitics was not fully processed (no iteration).
There was an error in the elision regular expression the "X'" elision must be at the begining of the token.
h3. Validation test
Import with clipboard in FR with the following content:
<pre>
le mot Bar-le-Duc et le mot mont-d'or
l'apostrophe ne pose pas d'problème !
</pre>
"Bar-le-Duc" and "mont-d'or" should be words.
The word lexicon should be
<pre>
le
mot
Bar-le-Duc
et
le
mot
mont-d'or
l'
apostrophe
ne
pose
pas
d'
problème
!
</pre>