Bug #1947: RCP: 0.7.8, english instead of french tokenisation - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #1947

Mis à jour par Matthieu Decorde il y a presque 9 ans

- import corpus A with XML/w module with 'en' tokenization and TreeTagger annotation
- corpus A is OK
- import corpus B with TXT+CSV with 'fr' tokenization and TreeTagger annotation
- corpus B has an english tokenization instead of french tokenization

Corpus A file.xml
<pre>
This corpus works fine. That's all.
</pre>

Corpus B file.txt
<pre>
C'est une phrase qui ne s'tokenize pas bien.
</pre>

h3. Solution 1

MD: Strings with a punctuation followed by a clitics are not well tokenized

Fix the tokenizer punctuation regular expression to process the punctuation before processing the clitics punctuation.

Retour

Laboratoire ICAR » Plateforme TXM

Bug #1947