Feature #1807

RCP: X.X, TreeTagger extension, train failed with incomplete lexicon

Added by Matthieu Decorde almost 3 years ago. Updated almost 3 years ago.

Status:New Start date:06/08/2016
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Commands Spent time: -
Target version:TXM Palafra 3.0

Description

We want to train fro2.par with BFMGOLD2 and frolex-tt.tsv

TreeTagger fails with message :

TRAIN : BFMGOLD2 with /home/alavrent/LemmatisationTXM/frolex-tt.tsv to create fro2.par with properties [pos, frolemma]
TT SRC file: /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt
Warning, lexicon errors (20) found with words:
Quil=[]
tres=[]
Amen=[]
dune=[ADVint]
o=[]
fors=[]
Dum=[ADVint]
buver=[]
»=[]
Dumne=[ADVint]
... errors display is trucated, see /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/errors.txt
Adding words to a temporary lexicon: /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix
Running 

ERROR: Missing lemma in line 579
Process exited abnormally with code 1 at Friday, 15 July 2016
/home/alavrent/Software/TreeTagger/bin/train-tree-tagger -quiet -st PONfrt -utf8 /home/alavrent/LemmatisationTXM/frolex-tt.tsv.fix /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/openclasses.txt /home/alavrent/TXM/corpora/BFMGOLD2/treetagger/BFMGOLD2.tt /home/alavrent/fro2.par 
Done: /home/alavrent/fro2.par

Diagnostics

  • what is the 'frolemma' parameter used for? Is the learning process supposed to access some lemma property in the gold corpus?
  • lexicon errors come from upstream conversion errors which come from upstream incorrect pos values -> we must display conversion errors during the conversions process: example "unknown from tag value: 'PONffbl'", "unknown from tag value: 'PONfbl '"

Solution

Update frolex-1.0.tsv to frolex-2.0.tsv using BFMGOLD form+pos[+F].
Add a lexicon check step before training TreeTagger.

History

#1 Updated by Matthieu Decorde almost 3 years ago

  • Description updated (diff)

#2 Updated by Alexey Lavrentev almost 3 years ago

  • Description updated (diff)

Also available in: Atom PDF