Feature #3050

UDPipe annotation engine, tokenizer&sentencer, better integration

Ajouté par Matthieu Decorde il y a plus de 4 ans.

Statut:New Début:09/04/2021
Priorité:Normal Echéance:
Assigné à:- % réalisé:

0%

Catégorie:Annotation Temps passé: -
Version cible:TXM - Eltec 1.0

Description

The UD pipe tokenizer&sentencer (and TXM default tokenization process too) processes can't work properly if the text to tokenize is too much segmented.

For example, the following XML won't be processed correctly :

<text>
Some Text to tokenize <hi> and </hi> to sentence.
</text>

The processes will receive 2 text segments instead of one.

Formats disponibles : Atom PDF