Bug #2280

RCP: 0.7.8, missing TreeTagger french tokenisation rules

Added by Matthieu Decorde almost 2 years ago. Updated 5 months ago.

Status:New Start date:11/14/2017
Priority:High Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM 0.8.1

Description

/** The TT enclitics. */
public static String FClitic_en = "'(s|re|ve|d|m|em|ll)|n['‘’]t";
public static String PClitic_fr = '[dcjlmnstDCJLNMST][\'‘’]|[Qq]u[\'‘’]|[Jj]usqu[\'‘’]|[Ll]orsqu[\'‘’]|[Pp]uisqu[\'‘’]|[Qq]uoiqu[\'‘’]';
public static String FClitic_fr = '-t-elles?|-t-ils?|-t-on|-ce|-elles?|-ils?|-je|-la|-les?|-leur|-lui|-mêmes?|-m[\'‘’]|-moi|-nous|-on|-toi|-tu|-t[\'‘’]|-vous|-en|-y|-ci|-là';
public static String PClitic_it = '[dD][ae]ll[\'‘’]|[nN]ell[\'‘’]|[Aa]ll[\'‘’]|[lLDd][\'‘’]|[Ss]ull[\'‘’]|[Qq]uest[\'‘’]|[Uu]n[\'‘’]|[Ss]enz[\'‘’]|[Tt]utt[\'‘’]';
public static String FClitic_gl = '-la|-las|-lo|-los|-nos';

BP 2019-04-08 - Contribution to diagnostic
(i) For PClitic_fr, one should also manage the case of "y'" and "Y'" (especially for speech transcriptions). Cf. INDEX of .'.+ in LEMAN corpus (Fmin=2) :
y'a 127
y'en 30
Y'a 21
Y'en 8
y'avait 4
y'aura 3
y'ait 2
See also Montpellier team's experiments on Rivesaltes corpus (Matrice project, April 5th 2019 Copil)
(ii) The processing for French "t euphonique" is not clear either. Here are examples taken from VOEUX :
0013 L'année 1971 n'en [a-t]_ADJ [-elle]_PRO:PER pas apporté quelques preuves ?
0014 l'année de la sagesse. [Puisse-t-]_NOM elle, Français, Françaises, être pour chacun et
0021 de fête. 1980 nous [apportera-t]_VER:simp [-il]_PRO:PER la paix ou la guerre ?
0035 qu'on nous annonce [amorcera-]_NOM [t]_VER:simp [-elle]_PRO:PER la décrue du chômage ?

History

#1 Updated by Sebastien Jacquot about 1 year ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#2 Updated by Matthieu Decorde 5 months ago

  • Target version changed from TXM 0.8.0 to TXM X.X

#3 Updated by Benedicte Pincemin 5 months ago

  • Description updated (diff)

#4 Updated by Matthieu Decorde 5 months ago

  • Priority changed from Normal to High
  • Target version changed from TXM X.X to TXM 0.8.1

#5 Updated by Benedicte Pincemin 5 months ago

  • Description updated (diff)
  • Priority changed from High to Normal
  • Target version changed from TXM 0.8.1 to TXM X.X

#6 Updated by Benedicte Pincemin 5 months ago

  • Priority changed from Normal to High
  • Target version changed from TXM X.X to TXM 0.8.1

#7 Updated by Benedicte Pincemin 5 months ago

  • Description updated (diff)

Also available in: Atom PDF