Bug #1592

Updated by Serge Heiden almost 4 years ago

2 main bugs has been found :

"Bar-le-Duc" is tokenized : "Bar" "-l"
"-Duc" is dropped and "-le" is truncated

Qu'est-ce is tokenized "Qu'" "-ce"
"est" is dropped

"mont-d'or" is tokenized "mont-d'" "or" instead of "mont-d'or"

h3. Solution

There was an error in the french clitic regular expression and the neighborhood characters of the clitics was not fully processed (no iteration).

There was an error in the elision regular expression the "X'" elision must be at the begining of the token.

h3. Validation test

Import with clipboard in FR with import the following content: content
<pre>
Bar-le-Duc est tokenisé : Bar -l "mont-d'or" is tokenized "mont-d'" "or" instead of "mont-d'or"
-Duc est supprimé et -le est tronqué comme -l
mont-d'or est tokenisé mont-d' au lieu de mont-d'or
</pre>
"Bar-le-Duc" and "mont-d'or" should be words.

Back