Feature #1442
TBX: x.x, chinese tokenizer and tagger with ZPar (ZH language)
Statut: | New | Début: | 24/08/2015 | |
---|---|---|---|---|
Priorité: | Normal | Echéance: | ||
Assigné à: | - | % réalisé: | 0% |
|
Catégorie: | Import | Temps passé: | - | |
Version cible: | TXM 0.X.X |
Description
The ZPar software provides good tokenization and tagging results for simplified chinese. Integrate ZPar into TXM.
(See https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_le_chinois for Chinese NLP state of the art).
To integrate that technology into TXM we must:- make the tokenizer algorithm (and any NLP related component) depend on the source language choice. Currently there is no relation between the tokenizer and the language chosen for a corpus (except maybe a locale value used to interpret Unicode character classes in the tokenizer parameters regexps). The logic must become "choose the best, or a particular, NLP technology for a given language, including tokenization". This logic can be restricted to the TXT import module in a first time (unless XML aware NLP components can be used in the XML/w+CSV import module).
- build a component to associate source languages and NLP components
- link that component to the string collation policy given by the locale parameter (language names of string collation may be different from language names for NLP components)
- implement LT ZPar workflow in Java/Groovy
## include in particular the one word per line intermediate step and be able to re-run the workflow (make it re-entrant) to take word corrections - segmentation or tagging - into account (the user must be able to correct the chinese NLP component results before finalizing the import process)
Historique
#1 Mis à jour par Serge Heiden il y a environ 10 ans
- Description mis à jour (diff)
#2 Mis à jour par Matthieu Decorde il y a environ 10 ans
- Version cible changé de TXM 0.7.8 à TXM 0.8.0a (split/restructuration)
#3 Mis à jour par Sebastien Jacquot il y a plus de 7 ans
- Version cible changé de TXM 0.8.0a (split/restructuration) à TXM 0.8.0
#4 Mis à jour par Matthieu Decorde il y a plus de 6 ans
- Version cible changé de TXM 0.8.0 à TXM 0.X.X