Bug #1592

TBX: X.X, Tokenizer words preceding and following clitics are lost

Added by Matthieu Decorde about 8 years ago. Updated over 7 years ago.

Status:New Start date:11/13/2015
Priority:High Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.7.8

Description

2 main bugs has been found :

"Bar-le-Duc" is tokenized : "Bar" "-l"
"-Duc" is dropped and "-le" is truncated

Qu'est-ce is tokenized "Qu'" "-ce"
"est" is dropped

"mont-d'or" is tokenized "mont-d'" "or" instead of "mont-d'or"

Solution

There was an error in the french clitic regular expression and the neighborhood characters of the clitics was not fully processed (no iteration).

There was an error in the elision regular expression the "X'" elision must be at the begining of the token.

Use the TreeTagger pclitic tokenization rules and replace the "'" quote with multiple quotes regular expression (We can't use the unicode category since the quote category is 'Po' and it contains other word separating characters : http://www.fileformat.info/info/unicode/category/Po/list.htm)

Validation test

Import through clipboard the following content in FR:

Bar-le-Duc n'est pas le Mont-d'or
La rue de la Goutte-d'Or ou la rue de la Chaussée-d'Antin.
le mot est-il ?
c'est comme-ci ou comme-là
qu'il faut c'est-à-dire
mot-composé-de-tirets
l'apostrophe ne pose pas d'problème !
mot composé assemblée_générale
c'est "la fin" pour aujourd'hui.

The internal view should be composed of 4 pages:

Bar-le-Duc
n'
est
pas
le
Mont-d'or
La
rue
de
la
Goutte-d'Or
ou
la
rue
de
la
Chaussée-d'Antin
.

le
mot
est
-il
?

c'
est
comme
-ci
ou
comme
-là
qu'
il
faut
c'
est-à-dire
mot-composé-de-tirets
l'
apostrophe
ne
pose
pas
d'
problème
!

mot
composé
assemblée_générale
c'
est
" 
la
fin
" 
pour
aujourd'hui
.

History

#1 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#2 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#3 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)

#4 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#5 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#6 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#7 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#8 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#9 Updated by Matthieu Decorde over 7 years ago

  • Description updated (diff)

#10 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)

#11 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)

Also available in: Atom PDF