Feature #3051

Tokenizer, separate the XML parsing and the String tokenization processes

Added by Matthieu Decorde about 1 month ago. Updated about 1 month ago.

Status:New Start date:04/09/2021
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Annotation Spent time: -
Target version:TXM - Eltec 1.0

Description

The TXM tokenizer class (SimpleXMLTokenizer) must be splitted in 2 classes :
  • SimpleXMLTokenizer
  • SimpleStringTokenizer

SimpleXMLTokenizer using by default the SimpleStringTokenizer class to tokenize text.

This will allow to work with another StringTokenizer (likethe UDpipe tokenizer)


Related issues

related to Feature #3049: UDPipe annotation engine, tokenizer New 04/09/2021

History

#1 Updated by Matthieu Decorde about 1 month ago

  • Subject changed from Tokenizer, separate XML parsing from the String tokenization process to Tokenizer, separate the XML parsing and the String tokenization processes

Also available in: Atom PDF