Bug #2821

Import, broken generated word id

Added by Matthieu Decorde 5 months ago. Updated 4 months ago.

Status:New Start date:05/15/2020
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.8.1

Description

The generated word ids are missing their "_"s

Solution

fix the buildId method in AsciiUtils.convertnonascii() (caused by: #2709)

replace the Transliterator rules with:

"Any-Latin; NFD; [^\\p{Alnum}\\p{p}] Remove" 

to not remove the punctuations (further AsciiUtils methods will do it)

Associated revisions

Revision 2857
Added by Matthieu Decorde 5 months ago

fix word id generation refs #2821

History

#1 Updated by Matthieu Decorde 5 months ago

  • Description updated (diff)

#2 Updated by Matthieu Decorde 5 months ago

  • Description updated (diff)

#3 Updated by Matthieu Decorde 5 months ago

  • % Done changed from 0 to 80

to be tested in next setup/update

#4 Updated by Alexey Lavrentev 4 months ago

Test works fine as formulated in https://groupes.renater.fr/wiki/txm-users/public/retours_de_bugs_logiciel/txm_0.8.1beta#retours.

But several problems persist:
  • the letters in existing Id are converted to lower case;
  • no check on duplicate id is run. To test:
    1. Create a t1.xml and paste the following content:
      <text id="T1">
          <w id="w_recup_0">début</w>
          du texte.
          <w id="11">les</w> 
          <w id="w_t1_2">mots</w>
          <w id="w_T1_3">suivant</w>
          <w id="4">du</w>
          <w id="5">textes</w>
          <w id="7">.</w>
          <w id="w_recup_8">fin</w>
          <w id="w_recup_9">.</w>
      </text>
      
    2. Use XML/W+CSV import module
    3. Make a lexicon of the "id" word property
    4. You will get:
      w_t1_2    2
      w_t1_3    2
      w_11    1
      w_4    1
      w_5    1
      w_7    1
      w_recup_0    1
      w_recup_8    1
      w_recup_9    1
      w_sans_titre1_4    1
      

#5 Updated by Matthieu Decorde 4 months ago

  • % Done changed from 80 to 60

need to fix existing ID with min&maj characters

#6 Updated by Matthieu Decorde 4 months ago

  • % Done changed from 60 to 80

see r2904

Also available in: Atom PDF