Bug #1292
RCP: 0.7.7, in some cases, TreeTagger provides an incorrect lemma
Status: | New | Start date: | 03/31/2015 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Import | Spent time: | - | |
Target version: | TXM X.X |
Description
In some cases to identify, TreeTagger does not tag right lemma for the token.
For example in FR, the .tt files can contain the lemmas "l'", "le" and "la" for the tokens "l'", "le" and "la" rather than the lemma "le" for all these 3 tokens.
It leads to some incomprehensible frlemma property Queries, Indexes, Concordances, etc.
This behavior can easily be reproducible when TreeTagger is configured with a language different from the corpora.
I guess this behavior may also occurs when using a very poor fr.par file (without lemmas or with only a few lemmas) ?
Possible reasons of this behavior:
- wrong language chosen from the corpora import form
- wrong .par file name, e.g. an en.par file would be renamed to fr.par
For user information/warning, first proposals:
FR: A priori, notre guesseur de langue pourrait dire son étonnement du choix du modèle de langue TT à utiliser, [...] Une autre façon de faire pourrait être : 1) calculer le ratio (nombre d'occurrences de lemmes de forme différente de la forme graphique) / (nombre d'occurrences de lemmes de forme identique à la forme graphique) et quand ce ratio est sous un seuil faire 2) appliquer TT en retirant l'option 'mettre la forme dans le lemme par défaut' et si le ratio (nombre de lemme inconnus) / (nombre de lemmes connus) est sous un seuil on déclenche un diagnostic d'étonnement sur le modèle de langue choisie
History
#1 Updated by Sebastien Jacquot about 8 years ago
- Description updated (diff)
#2 Updated by Matthieu Decorde about 8 years ago
- Subject changed from RCP: 0.7.7, in some cases, TreeTagger does not tag right lemma to RCP: 0.7.7, in some cases, TreeTagger provides an incorrect lemma
#3 Updated by Matthieu Decorde over 7 years ago
- Target version changed from TXM 0.7.8 to TXM 0.8.0a (split/restructuration)
#4 Updated by Sebastien Jacquot almost 5 years ago
- Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0
#5 Updated by Matthieu Decorde about 4 years ago
- Target version changed from TXM 0.8.0 to TXM X.X