Feature #1442

Updated by Serge Heiden about 4 years ago

The ZPar software provides good tokenization and tagging results for simplified chinese. Integrate ZPar into TXM.

(See https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_le_chinois for Chinese NLP state of the art).

To integrate that technology into TXM we must:
# make the tokenizer algorithm (and any NLP related component) depend on the source language choice. Currently there is no relation between the tokenizer and the language chosen for a corpus (except maybe a locale value used to interpret Unicode character classes in the tokenizer parameters regexps). corpus. The logic must become "choose the best, or a particular, NLP technology for a given language, including tokenization". This logic can be restricted to the TXT import module in a first time (unless XML aware NLP components can be used in the XML/w+CSV import module).
# build a component to associate source languages and NLP components
# link that component to the string collation policy given by the locale parameter (language names of string collation may be different from language names for NLP components)
# implement LT ZPar workflow in Java/Groovy
## include in particular the one word per line intermediate step and be able to re-run the workflow (make it re-entrant) to take word corrections - segmentation or tagging - into account (the user must be able to correct the chinese NLP component results before finalizing the import process)

Back