Feature #1807

Updated by Alexey Lavrentev about 4 years ago

We want to train fro2.par with BFMGOLD2 BFMGOLD and frolex-tt.tsv frolex-1.0.tsv

TreeTagger fails with message :
<pre>
TRAIN : BFMGOLD2 with /home/alavrent/LemmatisationTXM/frolex-tt.tsv to create fro2.par with properties [pos, frolemma]
TT SRC file: /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt
Warning,
Your lexicon errors (20) found with words: sucks.
Quil=[]
tres=[]
Amen=[]
dune=[ADVint]
o=[]
fors=[]
Dum=[ADVint]
buver=[]
»=[]
Dumne=[ADVint]
... errors display is trucated, see /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/errors.txt
Adding words to a temporary lexicon: /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix
Running

ERROR: Missing lemma in line 579
Process exited abnormally with code 1 at Friday, 15 July 2016
/home/alavrent/Software/TreeTagger/bin/train-tree-tagger -quiet -st PONfrt -utf8 /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/openclasses.txt /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt /home/alavrent/fro2.par
Done: /home/alavrent/fro2.par
</pre>

h3. Diagnostics

* what is the 'frolemma' parameter used for? Is the learning process supposed to access some lemma property in the gold corpus?
* lexicon errors come from upstream conversion errors which come from upstream incorrect pos values -> we must display conversion errors during the conversions process: example "unknown from tag value: 'PONffbl'", "unknown from tag value: 'PONfbl '"


h3. Solution

Update frolex-1.0.tsv to frolex-2.0.tsv using BFMGOLD form+pos[+F].
Add a lexicon check step before training TreeTagger.

Back