Bug #1947

RCP: 0.7.8, english instead of french tokenisation

Added by Matthieu Decorde over 2 years ago. Updated over 2 years ago.

Status:New Start date:11/28/2016
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.7.8

Description

- import corpus A with XML/w module with 'en' tokenization and TreeTagger annotation
- corpus A is OK
- import corpus B with TXT+CSV with 'fr' tokenization and TreeTagger annotation
- corpus B has an english tokenization instead of french tokenization

Corpus A file.xml

This corpus works fine. That's all.

Corpus B file.txt

C'est une phrase qui ne s'tokenize pas bien.

Solution 1

MD: Strings with a punctuation followed by a clitics are not well tokenized

Fix the tokenizer punctuation regular expression to process the punctuation before processing the clitics punctuation.

Solution 2

The problem was not caused by the language parameter but from the rule to tokenize clitics.

the corpus content was more like

C' est une phrase qui ne s' tokenize pas bien.

and the tokenizer rule was "${cliticrule}.+"

The solution is to change the rule to "${cliticrule}.*"

History

#1 Updated by Matthieu Decorde over 2 years ago

  • Description updated (diff)
  • % Done changed from 0 to 20

#2 Updated by Matthieu Decorde over 2 years ago

  • Description updated (diff)
  • % Done changed from 20 to 80

Also available in: Atom PDF