Bug #1947
RCP: 0.7.8, english instead of french tokenisation
Status: | New | Start date: | 11/28/2016 | ||
---|---|---|---|---|---|
Priority: | Normal | Due date: | |||
Assignee: | - | % Done: | 80% |
||
Category: | Import | Spent time: | - | ||
Target version: | TXM 0.7.8 |
Description
- import corpus A with XML/w module with 'en' tokenization and TreeTagger annotation
- corpus A is OK
- import corpus B with TXT+CSV with 'fr' tokenization and TreeTagger annotation
- corpus B has an english tokenization instead of french tokenization
Corpus A file.xml
This corpus works fine. That's all.
Corpus B file.txt
C'est une phrase qui ne s'tokenize pas bien.
Solution 1¶
MD: Strings with a punctuation followed by a clitics are not well tokenized
Fix the tokenizer punctuation regular expression to process the punctuation before processing the clitics punctuation.
Solution 2¶
The problem was not caused by the language parameter but from the rule to tokenize clitics.
the corpus content was more like
C' est une phrase qui ne s' tokenize pas bien.
and the tokenizer rule was "${cliticrule}.+"
The solution is to change the rule to "${cliticrule}.*"
History
#1 Updated by Matthieu Decorde almost 7 years ago
- Description updated (diff)
- % Done changed from 0 to 20
#2 Updated by Matthieu Decorde almost 7 years ago
- Description updated (diff)
- % Done changed from 20 to 80