Feature #3050
UDPipe annotation engine, tokenizer&sentencer, better integration
Statut: | New | Début: | 09/04/2021 | |
---|---|---|---|---|
Priorité: | Normal | Echéance: | ||
Assigné à: | - | % réalisé: | 0% |
|
Catégorie: | Annotation | Temps passé: | - | |
Version cible: | TXM - Eltec 1.0 |
Description
The UD pipe tokenizer&sentencer (and TXM default tokenization process too) processes can't work properly if the text to tokenize is too much segmented.
For example, the following XML won't be processed correctly :
<text> Some Text to tokenize <hi> and </hi> to sentence. </text>
The processes will receive 2 text segments instead of one.