Bug #2373

RCP: 0.7.9, XTZ + CSV import module: error in page indexing if source file name contains the "_" character

Added by Alexey Lavrentev over 1 year ago. Updated 5 months ago.

Status:New Start date:04/30/2018
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:- Spent time: -
Target version:TXM 0.8.1

Description

If the corpus source directory contains files differentiated by underscore-separated suffixes, the index of pages in the import contains duplicates.

Example

Sources files :
  • mytext.xml
  • mytext_a.xml

import.xml in the binary corpus:

            <text name="mytext">
               <source file="/home/user/TXM/corpora/MYCORPUS/txm/MYCORPUS/mytext.xml" 
                       type=".xml"/>
               <editions>
                  <edition index="/home/user/TXM/corpora/MYCORPUS/HTML/MYCORPUS/default" 
                           mode="xsl" 
                           name="default" 
                           script="1-default-html.xsl" 
                           type="html">
                     <page id="1" wordid="w_0"/>
                     <page id="a_1" wordid="w_0"/>
                     <page id="a_2" wordid="mytext_a_1"/>
                     <page id="2" wordid="mytext_1"/>
                     <page id="3" wordid="mytext_137"/>
                     <page id="a_3" wordid="mytext_a_18"/>
          </editions>
</text>

Solution

  • Correct the regexp pattern when searching for pages to index
  • Use a more solid mechanism for page indexing

Temporary workaround

  • Document the restriction on file names
  • Patch binary corpus with an XSLT

History

#1 Updated by Alexey Lavrentev over 1 year ago

  • Target version changed from Portal 0.8 to TXM 0.8.0a (split/restructuration)

#2 Updated by Sebastien Jacquot about 1 year ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#3 Updated by Matthieu Decorde 5 months ago

  • Target version changed from TXM 0.8.0 to TXM 0.8.1

Also available in: Atom PDF