Feature #217
RCP: x.x, Manage enclitics in the tokenizer
Status: | Closed | Start date: | 07/08/2013 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 100% |
|
Category: | Import | Spent time: | - | |
Target version: | TXM 0.7.7 |
Description
The SimpleTokenizerXML does not use language specific rules to tokenize clitics.
Solution 1¶
Use TreeTagger clitic tokenizer rules for the fr, en and it languages as defined in the in "Gestion de la langue" section of https://groupes.renater.fr/wiki/txm-info/public/composant_de_tokenisation#solution_1_simpletokenizerxml
Solution 2¶
Use another tokenizer, to be choosen between existing solutions of https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#solution, if TreeTagger lemmatization is not used.
History
#1 Updated by Matthieu Decorde about 6 years ago
- Tracker changed from Task to Feature
- Description updated (diff)
- Target version changed from TXM X.X to TXM 0.7.7
- % Done changed from 0 to 80
#2 Updated by Matthieu Decorde about 6 years ago
- Description updated (diff)
#3 Updated by Serge Heiden about 6 years ago
- Description updated (diff)
#4 Updated by Matthieu Decorde almost 2 years ago
- % Done changed from 80 to 100
#5 Updated by Matthieu Decorde almost 2 years ago
- Status changed from New to Closed