Feature #1442

TBX: x.x, chinese tokenizer and tagger with ZPar (ZH language)

Added by Serge Heiden over 7 years ago. Updated over 3 years ago.

Status:New Start date:08/24/2015
Priority:Normal Due date:
Assignee:- % Done:


Category:Import Spent time: -
Target version:TXM X.X


The ZPar software provides good tokenization and tagging results for simplified chinese. Integrate ZPar into TXM.

(See https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_le_chinois for Chinese NLP state of the art).

To integrate that technology into TXM we must:
  1. make the tokenizer algorithm (and any NLP related component) depend on the source language choice. Currently there is no relation between the tokenizer and the language chosen for a corpus (except maybe a locale value used to interpret Unicode character classes in the tokenizer parameters regexps). The logic must become "choose the best, or a particular, NLP technology for a given language, including tokenization". This logic can be restricted to the TXT import module in a first time (unless XML aware NLP components can be used in the XML/w+CSV import module).
  2. build a component to associate source languages and NLP components
  3. link that component to the string collation policy given by the locale parameter (language names of string collation may be different from language names for NLP components)
  4. implement LT ZPar workflow in Java/Groovy
    ## include in particular the one word per line intermediate step and be able to re-run the workflow (make it re-entrant) to take word corrections - segmentation or tagging - into account (the user must be able to correct the chinese NLP component results before finalizing the import process)


#1 Updated by Serge Heiden over 7 years ago

  • Description updated (diff)

#2 Updated by Matthieu Decorde about 7 years ago

  • Target version changed from TXM 0.7.8 to TXM 0.8.0a (split/restructuration)

#3 Updated by Sebastien Jacquot over 4 years ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#4 Updated by Matthieu Decorde over 3 years ago

  • Target version changed from TXM 0.8.0 to TXM X.X

Also available in: Atom PDF