Feature #449
TBX: x.x, Tokenizers strategy and components
Status: | New | Start date: | 11/13/2013 | |
---|---|---|---|---|
Priority: | High | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Import | Spent time: | - | |
Target version: | TXM X.X |
Description
A) add a system (plugin or not) to be able to use different Tokenizers.
B) finish some existing tokenizers (TreeTagger Groovy tokenizer...)
C) adapt existing tokenizers for the selection system
D) develop various strategies in tokenizers:
FR: On peut développer deux stratégies de tokenizer simple :
a) priorité à la délimitation par caractères séparateurs (stratégie TXM 0.7.2)
b) priorité à la délimitation par caractères constituants (stratégie Weblex)
[même si les stratégies sont un peu mixtes]
E) incorporate new components (Unitex, tagging env developed in Perl for TXM...)
F) be able to connect linguistic ressources to tokenizers (FR: figements,
locutions, liste d'abbréviations contenant des points, clitiques, etc.)
History
#1 Updated by Serge Heiden almost 10 years ago
- Subject changed from RCP: x.x, be able to plug another Tokenizer to TBX: x.x, Tokenizers strategy and components
- Description updated (diff)
- Priority changed from Normal to High