Bug #1947: RCP: 0.7.8, english instead of french tokenisation - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #1947

RCP: 0.7.8, english instead of french tokenisation

Ajouté par Matthieu Decorde il y a presque 9 ans. Mis à jour il y a presque 2 ans.

Statut:

Closed

Début:

28/11/2016

Priorité:

Normal

Echéance:

Assigné à:

% réalisé:

100%

Catégorie:

Import

Temps passé:

Version cible:

TXM 0.7.8

Description

- import corpus A with XML/w module with 'en' tokenization and TreeTagger annotation
- corpus A is OK
- import corpus B with TXT+CSV with 'fr' tokenization and TreeTagger annotation
- corpus B has an english tokenization instead of french tokenization

Corpus A file.xml

This corpus works fine. That's all.

Corpus B file.txt

C'est une phrase qui ne s'tokenize pas bien.

Solution 1¶

MD: Strings with a punctuation followed by a clitics are not well tokenized

Fix the tokenizer punctuation regular expression to process the punctuation before processing the clitics punctuation.

Solution 2¶

The problem was not caused by the language parameter but from the rule to tokenize clitics.

the corpus content was more like

C' est une phrase qui ne s' tokenize pas bien.

and the tokenizer rule was "${cliticrule}.+"

The solution is to change the rule to "${cliticrule}.*"

Historique

#1 Mis à jour par Matthieu Decorde il y a presque 9 ans

Description mis à jour (diff)
% réalisé changé de 0 à 20

#2 Mis à jour par Matthieu Decorde il y a presque 9 ans

Description mis à jour (diff)
% réalisé changé de 20 à 80

#3 Mis à jour par Sebastien Jacquot il y a presque 2 ans

Statut changé de New à Closed

#4 Mis à jour par Sebastien Jacquot il y a presque 2 ans

% réalisé changé de 80 à 100

Formats disponibles : Atom PDF

Laboratoire ICAR » Plateforme TXM

Demandes

Rapports personnalisés