Bug #3389

import, impossible to tokenize words written with point (.) characters inside

Added by Serge Heiden 4 months ago. Updated 4 months ago.

Status:New Start date:05/15/2023
Priority:Normal Due date:
Assignee:- % Done:

90%

Category:Import Spent time: -
Target version:TXM 0.8.3

Description

Given transcription principles using point characters inside words, for example the following TXT input where words are separated by space :

ḫr ḥm nỉ Ḥrw ‘nḫ-mst.pl nb.tỉ ‘nḫ-mst.pl nswt-bỉtỉ Ḫpr-kȝ-R‘

A) It is not possible to find correct XTZ or TXT import module parameters values to tokenize words with points inside.

Even when removing punctuations regex and point from sentence segmentation parameters.

Given impossibility to provide a import.xml file for example parameters, here is a screenshot of the parameters setting: import-txt-words-no-point.png

Here is the index of the ".*\..*" CQL : import-txt-words-no-point-words-with-points.png

B) Points are always rendered in editions with respect to default point formating rules of the current language

MD: when correctly tokenized the points rendering (in Edition and Concordance) is OK

See edition screenshot: import-txt-words-no-point-edition.png

import-txt-words-no-point.png (13.1 kB) Serge Heiden, 05/15/2023 12:06 pm

import-txt-words-no-point-words-with-points.png (20.7 kB) Serge Heiden, 05/15/2023 12:06 pm

import-txt-words-no-point-edition.png (13.9 kB) Serge Heiden, 05/15/2023 12:06 pm

History

#1 Updated by Matthieu Decorde 4 months ago

  • % Done changed from 0 to 80

The import parameters were re-initialized.

#2 Updated by Matthieu Decorde 4 months ago

  • Description updated (diff)

#3 Updated by Matthieu Decorde 4 months ago

Index result when removing "." from the tokenizer import parameters

word    Fréquence
‘    3
nḫ-mst.pl    2
ḥm    1
Ḫpr-kȝ-R    1
ḫr    1
Ḥrw    1
nb.tỉ    1
nỉ    1
nswt-bỉtỉ    1

Edition rendering :

ḫr ḥm nỉ Ḥrw ‘ nḫ-mst.pl nb.tỉ ‘ nḫ-mst.pl nswt-bỉtỉ Ḫpr-kȝ-R ‘

#4 Updated by Serge Heiden 4 months ago

  • % Done changed from 80 to 90

Correct token parameters usage has been verified on the test text sample (for 'accented characters' and 'sentence end characters').

Also available in: Atom PDF