Bug #3389
import, impossible to tokenize words written with point (.) characters inside
Status: | New | Start date: | 05/15/2023 | ||
---|---|---|---|---|---|
Priority: | Normal | Due date: | |||
Assignee: | - | % Done: | 90% |
||
Category: | Import | Spent time: | - | ||
Target version: | TXM 0.8.3 |
Description
Given transcription principles using point characters inside words, for example the following TXT input where words are separated by space :
ḫr ḥm nỉ Ḥrw ‘nḫ-mst.pl nb.tỉ ‘nḫ-mst.pl nswt-bỉtỉ Ḫpr-kȝ-R‘
A) It is not possible to find correct XTZ or TXT import module parameters values to tokenize words with points inside.
Even when removing punctuations regex and point from sentence segmentation parameters.
Given impossibility to provide a import.xml file for example parameters, here is a screenshot of the parameters setting: import-txt-words-no-point.png
Here is the index of the ".*\..*" CQL : import-txt-words-no-point-words-with-points.png
B) Points are always rendered in editions with respect to default point formating rules of the current language
MD: when correctly tokenized the points rendering (in Edition and Concordance) is OK
See edition screenshot: import-txt-words-no-point-edition.png
History
#1 Updated by Matthieu Decorde 4 months ago
- % Done changed from 0 to 80
The import parameters were re-initialized.
#2 Updated by Matthieu Decorde 4 months ago
- Description updated (diff)
#3 Updated by Matthieu Decorde 4 months ago
Index result when removing "." from the tokenizer import parameters
word Fréquence ‘ 3 nḫ-mst.pl 2 ḥm 1 Ḫpr-kȝ-R 1 ḫr 1 Ḥrw 1 nb.tỉ 1 nỉ 1 nswt-bỉtỉ 1
Edition rendering :
ḫr ḥm nỉ Ḥrw ‘ nḫ-mst.pl nb.tỉ ‘ nḫ-mst.pl nswt-bỉtỉ Ḫpr-kȝ-R ‘
#4 Updated by Serge Heiden 4 months ago
- % Done changed from 80 to 90
Correct token parameters usage has been verified on the test text sample (for 'accented characters' and 'sentence end characters').