Feature #1807
RCP: X.X, TreeTagger extension, train failed with incomplete lexicon
Status: | New | Start date: | 06/08/2016 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Commands | Spent time: | - | |
Target version: | TXM Palafra 3.0 |
Description
We want to train fro2.par with BFMGOLD2 and frolex-tt.tsv
TreeTagger fails with message :
TRAIN : BFMGOLD2 with /home/alavrent/LemmatisationTXM/frolex-tt.tsv to create fro2.par with properties [pos, frolemma] TT SRC file: /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt Warning, lexicon errors (20) found with words: Quil=[] tres=[] Amen=[] dune=[ADVint] o=[] fors=[] Dum=[ADVint] buver=[] »=[] Dumne=[ADVint] ... errors display is trucated, see /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/errors.txt Adding words to a temporary lexicon: /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix Running ERROR: Missing lemma in line 579 Process exited abnormally with code 1 at Friday, 15 July 2016 /home/alavrent/Software/TreeTagger/bin/train-tree-tagger -quiet -st PONfrt -utf8 /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/openclasses.txt /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt /home/alavrent/fro2.par Done: /home/alavrent/fro2.par
Diagnostics¶
- what is the 'frolemma' parameter used for? Is the learning process supposed to access some lemma property in the gold corpus?
- lexicon errors come from upstream conversion errors which come from upstream incorrect pos values -> we must display conversion errors during the conversions process: example "unknown from tag value: 'PONffbl'", "unknown from tag value: 'PONfbl '"
Solution¶
Update frolex-1.0.tsv to frolex-2.0.tsv using BFMGOLD form+pos[+F].
Add a lexicon check step before training TreeTagger.
History
#1 Updated by Matthieu Decorde over 7 years ago
- Description updated (diff)
#2 Updated by Alexey Lavrentev about 7 years ago
- Description updated (diff)