Bug #2280
RCP: 0.7.8, missing TreeTagger french tokenisation rules
Status: | New | Start date: | 11/14/2017 | ||
---|---|---|---|---|---|
Priority: | High | Due date: | |||
Assignee: | - | % Done: | 80% |
||
Category: | Import | Spent time: | - | ||
Target version: | TXM 0.8.2 |
Description
/** The TT enclitics. */
public static String FClitic_en = "'(s|re|ve|d|m|em|ll)|n['‘’]t";
public static String PClitic_fr = '[dcjlmnstDCJLNMST][\'‘’]|[Qq]u[\'‘’]|[Jj]usqu[\'‘’]|[Ll]orsqu[\'‘’]|[Pp]uisqu[\'‘’]|[Qq]uoiqu[\'‘’]';
public static String FClitic_fr = '-t-elles?|-t-ils?|-t-on|-ce|-elles?|-ils?|-je|-la|-les?|-leur|-lui|-mêmes?|-m[\'‘’]|-moi|-nous|-on|-toi|-tu|-t[\'‘’]|-vous|-en|-y|-ci|-là';
public static String PClitic_it = '[dD][ae]ll[\'‘’]|[nN]ell[\'‘’]|[Aa]ll[\'‘’]|[lLDd][\'‘’]|[Ss]ull[\'‘’]|[Qq]uest[\'‘’]|[Uu]n[\'‘’]|[Ss]enz[\'‘’]|[Tt]utt[\'‘’]';
public static String FClitic_gl = '-la|-las|-lo|-los|-nos';
BP 2019-04-08 - Contribution to diagnostic
(i) For PClitic_fr, one should also manage the case of "y'" and "Y'" (especially for speech transcriptions). Cf. INDEX of .'.+ in LEMAN corpus (Fmin=2) :
y'a 127
y'en 30
Y'a 21
Y'en 8
y'avait 4
y'aura 3
y'ait 2
See also Montpellier team's experiments on Rivesaltes corpus (Matrice project, April 5th 2019 Copil)
(ii) The processing for French "t euphonique" is not clear either. Here are examples taken from VOEUX :
0013 L'année 1971 n'en [a-t]_ADJ [-elle]_PRO:PER pas apporté quelques preuves ?
0014 l'année de la sagesse. [Puisse-t-]_NOM elle, Français, Françaises, être pour chacun et
0021 de fête. 1980 nous [apportera-t]_VER:simp [-il]_PRO:PER la paix ou la guerre ?
0035 qu'on nous annonce [amorcera-]_NOM [t]_VER:simp [-elle]_PRO:PER la décrue du chômage ?
=> This part (ii) is dealt in ticket #3090 [[https://forge.cbp.ens-lyon.fr/redmine/issues/3090]]
Solution¶
replace
public static String PClitic_fr = '[dcjlmnstDCJLNMST][\'‘’]|[Qq]u[\'‘’]|[Jj]usqu[\'‘’]|[Ll]orsqu[\'‘’]|[Pp]uisqu[\'‘’]|[Qq]uoiqu[\'‘’]';
With
public static String PClitic_fr = '[dcjlmnstyDCJLNMSTY][\'‘’]|[Qq]u[\'‘’]|[Jj]usqu[\'‘’]|[Ll]orsqu[\'‘’]|[Pp]uisqu[\'‘’]|[Qq]uoiqu[\'‘’]';
Associated revisions
refs #2280
History
#1 Updated by Sebastien Jacquot about 5 years ago
- Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0
#2 Updated by Matthieu Decorde over 4 years ago
- Target version changed from TXM 0.8.0 to TXM X.X
#3 Updated by Benedicte Pincemin over 4 years ago
- Description updated (diff)
#4 Updated by Matthieu Decorde over 4 years ago
- Priority changed from Normal to High
- Target version changed from TXM X.X to TXM 0.8.2
#5 Updated by Benedicte Pincemin over 4 years ago
- Description updated (diff)
- Priority changed from High to Normal
- Target version changed from TXM 0.8.2 to TXM X.X
#6 Updated by Benedicte Pincemin over 4 years ago
- Priority changed from Normal to High
- Target version changed from TXM X.X to TXM 0.8.2
#7 Updated by Benedicte Pincemin over 4 years ago
- Description updated (diff)
#8 Updated by Matthieu Decorde over 2 years ago
- Description updated (diff)
- % Done changed from 0 to 80
#9 Updated by Benedicte Pincemin about 2 years ago
- Description updated (diff)