Feature #1442

TBX: x.x, chinese tokenizer and tagger with ZPar (ZH language)

Ajouté par Serge Heiden il y a environ 10 ans. Mis à jour il y a plus de 6 ans.

Statut:New Début:24/08/2015
Priorité:Normal Echéance:
Assigné à:- % réalisé:

0%

Catégorie:Import Temps passé: -
Version cible:TXM 0.X.X

Description

The ZPar software provides good tokenization and tagging results for simplified chinese. Integrate ZPar into TXM.

(See https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#etat_de_l_art_pour_le_chinois for Chinese NLP state of the art).

To integrate that technology into TXM we must:
  1. make the tokenizer algorithm (and any NLP related component) depend on the source language choice. Currently there is no relation between the tokenizer and the language chosen for a corpus (except maybe a locale value used to interpret Unicode character classes in the tokenizer parameters regexps). The logic must become "choose the best, or a particular, NLP technology for a given language, including tokenization". This logic can be restricted to the TXT import module in a first time (unless XML aware NLP components can be used in the XML/w+CSV import module).
  2. build a component to associate source languages and NLP components
  3. link that component to the string collation policy given by the locale parameter (language names of string collation may be different from language names for NLP components)
  4. implement LT ZPar workflow in Java/Groovy
    ## include in particular the one word per line intermediate step and be able to re-run the workflow (make it re-entrant) to take word corrections - segmentation or tagging - into account (the user must be able to correct the chinese NLP component results before finalizing the import process)

Historique

#1 Mis à jour par Serge Heiden il y a environ 10 ans

  • Description mis à jour (diff)

#2 Mis à jour par Matthieu Decorde il y a environ 10 ans

  • Version cible changé de TXM 0.7.8 à TXM 0.8.0a (split/restructuration)

#3 Mis à jour par Sebastien Jacquot il y a plus de 7 ans

  • Version cible changé de TXM 0.8.0a (split/restructuration) à TXM 0.8.0

#4 Mis à jour par Matthieu Decorde il y a plus de 6 ans

  • Version cible changé de TXM 0.8.0 à TXM 0.X.X

Formats disponibles : Atom PDF