Feature #3520

Import, TreeTagger, upgrade TreeTagger options

Ajouté par Serge Heiden il y a presque 2 ans.

Statut:New Début:29/11/2023
Priorité:Normal Echéance:
Assigné à:- % réalisé:

0%

Catégorie:TAL Temps passé: -
Version cible:TXM 0.8.4

Description

Currently, TXM tokenises itself and calls tree-tagger directly. The TXM tokenizer separates clitics as the standard TreeTagger one, depending on language.

But TreeTagger parameter files often also depend on two additionnal lexicons related to tokenization to work properly (for example for Spoken French, Old French or Spanish):
  • abbreviations: a list of abbreviations for a language
  • mwls: a list of multi-tokens words for a language (called 'multi-words')

The full TreeTagger - Perl based - workflow is, for example for Spanish:

utf8-tokenize.perl -a spanish-abbreviations $* |
mwl-lookup.perl -f spanish-mwls |
tree-tagger -token -lemma -sgml spanish.par

Tools used

  • utf8-tokenize.perl: standard TreeTagger tokenizer
  • mwl-lookup.perl: merges multi-token words

New lexicons used

  • spanish-abbreviations:
    Ref.
    Vol.
    etc.
    App.
    Rec.
    
  • spanish-mwls:
    A diferencia de
    A diferencia del
    A fin de
    A lo largo de
    A medida que
    A menudo
    A partir de
    A pesar de
    ...
    

To be compatible with certain parameter files (Spoken French, Old French, Spanish...) we need to implement the abbreviations and mwls algorithms and use their lexicons.

Solution 1

  • add management between language names and abbreviations and mwls lexicons
  • add abbreviations processing to TXM tokenizer
  • implement mwl-lookup.perl multi-token processing

Solution 2

  • add management between language names and abbreviations and mwls lexicons
  • add a Perl processor
  • call utf8-tokenize.perl and mwl-lookup.perl directly

Formats disponibles : Atom PDF