Feature #449

TBX: x.x, Tokenizers strategy and components

Added by Matthieu Decorde almost 10 years ago. Updated almost 10 years ago.

Status:New Start date:11/13/2013
Priority:High Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

A) add a system (plugin or not) to be able to use different Tokenizers.
B) finish some existing tokenizers (TreeTagger Groovy tokenizer...)
C) adapt existing tokenizers for the selection system
D) develop various strategies in tokenizers:
FR: On peut développer deux stratégies de tokenizer simple :
a) priorité à la délimitation par caractères séparateurs (stratégie TXM 0.7.2)
b) priorité à la délimitation par caractères constituants (stratégie Weblex)
[même si les stratégies sont un peu mixtes]
E) incorporate new components (Unitex, tagging env developed in Perl for TXM...)
F) be able to connect linguistic ressources to tokenizers (FR: figements,
locutions, liste d'abbréviations contenant des points, clitiques, etc.)

History

#1 Updated by Serge Heiden almost 10 years ago

  • Subject changed from RCP: x.x, be able to plug another Tokenizer to TBX: x.x, Tokenizers strategy and components
  • Description updated (diff)
  • Priority changed from Normal to High

Also available in: Atom PDF