Feature #1807

RCP: X.X, TreeTagger extension, train failed with incomplete lexicon

Ajouté par Matthieu Decorde il y a plus de 9 ans. Mis à jour il y a environ 9 ans.

Statut:New Début:08/06/2016
Priorité:Normal Echéance:
Assigné à:- % réalisé:

0%

Catégorie:Commands Temps passé: -
Version cible:TXM Dictionaries X.X

Description

We want to train fro2.par with BFMGOLD2 and frolex-tt.tsv

TreeTagger fails with message :

TRAIN : BFMGOLD2 with /home/alavrent/LemmatisationTXM/frolex-tt.tsv to create fro2.par with properties [pos, frolemma]
TT SRC file: /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt
Warning, lexicon errors (20) found with words:
Quil=[]
tres=[]
Amen=[]
dune=[ADVint]
o=[]
fors=[]
Dum=[ADVint]
buver=[]
»=[]
Dumne=[ADVint]
... errors display is trucated, see /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/errors.txt
Adding words to a temporary lexicon: /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix
Running 

ERROR: Missing lemma in line 579
Process exited abnormally with code 1 at Friday, 15 July 2016
/home/alavrent/Software/TreeTagger/bin/train-tree-tagger -quiet -st PONfrt -utf8 /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/openclasses.txt /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt /home/alavrent/fro2.par 
Done: /home/alavrent/fro2.par

Diagnostics

  • what is the 'frolemma' parameter used for? Is the learning process supposed to access some lemma property in the gold corpus?
  • lexicon errors come from upstream conversion errors which come from upstream incorrect pos values -> we must display conversion errors during the conversions process: example "unknown from tag value: 'PONffbl'", "unknown from tag value: 'PONfbl '"

Solution

Update frolex-1.0.tsv to frolex-2.0.tsv using BFMGOLD form+pos[+F].
Add a lexicon check step before training TreeTagger.

Historique

#1 Mis à jour par Matthieu Decorde il y a plus de 9 ans

  • Description mis à jour (diff)

#2 Mis à jour par Alexey Lavrentev il y a environ 9 ans

  • Description mis à jour (diff)

Formats disponibles : Atom PDF