Bug #1292

RCP: 0.7.7, in some cases, TreeTagger provides an incorrect lemma

Added by Sebastien Jacquot over 4 years ago. Updated 7 months ago.

Status:New Start date:03/31/2015
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

In some cases to identify, TreeTagger does not tag right lemma for the token.
For example in FR, the .tt files can contain the lemmas "l'", "le" and "la" for the tokens "l'", "le" and "la" rather than the lemma "le" for all these 3 tokens.
It leads to some incomprehensible frlemma property Queries, Indexes, Concordances, etc.

This behavior can easily be reproducible when TreeTagger is configured with a language different from the corpora.
I guess this behavior may also occurs when using a very poor fr.par file (without lemmas or with only a few lemmas) ?

Possible reasons of this behavior:

  • wrong language chosen from the corpora import form
  • wrong .par file name, e.g. an en.par file would be renamed to fr.par

For user information/warning, first proposals:

FR:
A priori, notre guesseur de langue pourrait dire son étonnement du choix du modèle de langue TT à utiliser, [...]
Une autre façon de faire pourrait être :
1) calculer le ratio (nombre d'occurrences de lemmes de forme différente de la forme graphique) / (nombre d'occurrences de lemmes de forme identique à la forme graphique) et quand ce ratio est sous un seuil faire
2) appliquer TT en retirant l'option 'mettre la forme dans le lemme par défaut' et si le ratio (nombre de lemme inconnus) / (nombre de lemmes connus) est sous un seuil on déclenche un diagnostic d'étonnement sur le modèle de langue choisie

History

#1 Updated by Sebastien Jacquot over 4 years ago

  • Description updated (diff)

#2 Updated by Matthieu Decorde over 4 years ago

  • Subject changed from RCP: 0.7.7, in some cases, TreeTagger does not tag right lemma to RCP: 0.7.7, in some cases, TreeTagger provides an incorrect lemma

#3 Updated by Matthieu Decorde about 4 years ago

  • Target version changed from TXM 0.7.8 to TXM 0.8.0a (split/restructuration)

#4 Updated by Sebastien Jacquot over 1 year ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#5 Updated by Matthieu Decorde 7 months ago

  • Target version changed from TXM 0.8.0 to TXM X.X

Also available in: Atom PDF