Feature #3050

UDPipe annotation engine, tokenizer&sentencer, better integration

Added by Matthieu Decorde about 1 month ago.

Status:New Start date:04/09/2021
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Annotation Spent time: -
Target version:TXM - Eltec 1.0

Description

The UD pipe tokenizer&sentencer (and TXM default tokenization process too) processes can't work properly if the text to tokenize is too much segmented.

For example, the following XML won't be processed correctly :

<text>
Some Text to tokenize <hi> and </hi> to sentence.
</text>

The processes will receive 2 text segments instead of one.

Also available in: Atom PDF