Feature #3050

UDPipe annotation engine, tokenizer&sentencer, better integration

Added by Matthieu Decorde about 2 years ago.

Status:New Start date:04/09/2021
Priority:Normal Due date:
Assignee:- % Done:


Category:Annotation Spent time: -
Target version:TXM - Eltec 1.0


The UD pipe tokenizer&sentencer (and TXM default tokenization process too) processes can't work properly if the text to tokenize is too much segmented.

For example, the following XML won't be processed correctly :

Some Text to tokenize <hi> and </hi> to sentence.

The processes will receive 2 text segments instead of one.

Also available in: Atom PDF