Feature #3050: UDPipe annotation engine, tokenizer&sentencer, better integration - Plateforme TXM - Forge du Centre Blaise Pascal

Feature #3050

UDPipe annotation engine, tokenizer&sentencer, better integration

Ajouté par Matthieu Decorde il y a plus de 4 ans.

Statut:

New

Début:

09/04/2021

Priorité:

Normal

Echéance:

Assigné à:

% réalisé:

Catégorie:

Annotation

Temps passé:

Version cible:

TXM - Eltec 1.0

Description

The UD pipe tokenizer&sentencer (and TXM default tokenization process too) processes can't work properly if the text to tokenize is too much segmented.

For example, the following XML won't be processed correctly :

<text>
Some Text to tokenize <hi> and </hi> to sentence.
</text>

The processes will receive 2 text segments instead of one.

Formats disponibles : Atom PDF

Laboratoire ICAR » Plateforme TXM

Demandes

Rapports personnalisés

Feature #3050

UDPipe annotation engine, tokenizer&sentencer, better integration