Feature #1807
RCP: X.X, TreeTagger extension, train failed with incomplete lexicon
Statut: | New | Début: | 08/06/2016 | |
---|---|---|---|---|
Priorité: | Normal | Echéance: | ||
Assigné à: | - | % réalisé: | 0% |
|
Catégorie: | Commands | Temps passé: | - | |
Version cible: | TXM Dictionaries X.X |
Description
We want to train fro2.par with BFMGOLD2 and frolex-tt.tsv
TreeTagger fails with message :
TRAIN : BFMGOLD2 with /home/alavrent/LemmatisationTXM/frolex-tt.tsv to create fro2.par with properties [pos, frolemma] TT SRC file: /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt Warning, lexicon errors (20) found with words: Quil=[] tres=[] Amen=[] dune=[ADVint] o=[] fors=[] Dum=[ADVint] buver=[] »=[] Dumne=[ADVint] ... errors display is trucated, see /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/errors.txt Adding words to a temporary lexicon: /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix Running ERROR: Missing lemma in line 579 Process exited abnormally with code 1 at Friday, 15 July 2016 /home/alavrent/Software/TreeTagger/bin/train-tree-tagger -quiet -st PONfrt -utf8 /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/openclasses.txt /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt /home/alavrent/fro2.par Done: /home/alavrent/fro2.par
Diagnostics¶
- what is the 'frolemma' parameter used for? Is the learning process supposed to access some lemma property in the gold corpus?
- lexicon errors come from upstream conversion errors which come from upstream incorrect pos values -> we must display conversion errors during the conversions process: example "unknown from tag value: 'PONffbl'", "unknown from tag value: 'PONfbl '"
Solution¶
Update frolex-1.0.tsv to frolex-2.0.tsv using BFMGOLD form+pos[+F].
Add a lexicon check step before training TreeTagger.
Historique
#1 Mis à jour par Matthieu Decorde il y a plus de 9 ans
- Description mis à jour (diff)
#2 Mis à jour par Alexey Lavrentev il y a environ 9 ans
- Description mis à jour (diff)