Bug #3389: import, impossible to tokenize words written with point (.) characters inside - Plateforme TXM - Forge du Centre Blaise Pascal

Bug #3389

Mis à jour par Matthieu Decorde il y a plus de 2 ans

Given transcription principles using point characters inside words, for example the following TXT input where words are separated by space :

<pre>
ḫr ḥm nỉ Ḥrw ‘nḫ-mst.pl nb.tỉ ‘nḫ-mst.pl nswt-bỉtỉ Ḫpr-kȝ-R‘
</pre>

A) It is not possible to find correct XTZ or TXT import module parameters values to tokenize words with points inside.

Even when removing punctuations regex and point from sentence segmentation parameters.

Given impossibility to provide a import.xml file for example parameters, here is a screenshot of the parameters setting: import-txt-words-no-point.png

Here is the index of the ".*\..*" CQL : import-txt-words-no-point-words-with-points.png

B) Points are always rendered in editions with respect to default point formating rules of the current language

MD: when correctly tokenized the points rendering (in Edition and Concordance) is OK

See edition screenshot: import-txt-words-no-point-edition.png

Retour

Laboratoire ICAR » Plateforme TXM

Bug #3389