Feature #217

RCP: x.x, Manage enclitics in the tokenizer

Added by Matthieu Decorde almost 6 years ago. Updated over 2 years ago.

Status:New Start date:07/08/2013
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.7.7

Description

The SimpleTokenizerXML does not use language specific rules to tokenize clitics.

Solution 1

Use TreeTagger clitic tokenizer rules for the fr, en and it languages as defined in the in "Gestion de la langue" section of https://groupes.renater.fr/wiki/txm-info/public/composant_de_tokenisation#solution_1_simpletokenizerxml

Solution 2

Use another tokenizer, to be choosen between existing solutions of https://groupes.renater.fr/wiki/txm-info/public/specs_import_annotation_lexicale_auto#solution, if TreeTagger lemmatization is not used.

History

#1 Updated by Matthieu Decorde over 2 years ago

  • Tracker changed from Task to Feature
  • Description updated (diff)
  • Target version changed from TXM X.X to TXM 0.7.7
  • % Done changed from 0 to 80

#2 Updated by Matthieu Decorde over 2 years ago

  • Description updated (diff)

#3 Updated by Serge Heiden over 2 years ago

  • Description updated (diff)

Also available in: Atom PDF