Bug #2373
RCP: 0.7.9, XTZ + CSV import module: error in page indexing if source file name contains the "_" character
Status: | New | Start date: | 04/30/2018 | ||
---|---|---|---|---|---|
Priority: | Normal | Due date: | |||
Assignee: | - | % Done: | 80% |
||
Category: | Import | Spent time: | - | ||
Target version: | TXM 0.8.2 |
Description
If the corpus source directory contains files differentiated by underscore-separated suffixes, the index of pages in the import contains duplicates.
Example¶
Sources files :- mytext.xml
- mytext_a.xml
import.xml in the binary corpus:
<text name="mytext"> <source file="/home/user/TXM/corpora/MYCORPUS/txm/MYCORPUS/mytext.xml" type=".xml"/> <editions> <edition index="/home/user/TXM/corpora/MYCORPUS/HTML/MYCORPUS/default" mode="xsl" name="default" script="1-default-html.xsl" type="html"> <page id="1" wordid="w_0"/> <page id="a_1" wordid="w_0"/> <page id="a_2" wordid="mytext_a_1"/> <page id="2" wordid="mytext_1"/> <page id="3" wordid="mytext_137"/> <page id="a_3" wordid="mytext_a_18"/> </editions> </text>
Solution¶
- Correct the regexp pattern when searching for pages to index
- Use a more solid mechanism for page indexing
Temporary workaround¶
- Document the restriction on file names
- Patch binary corpus with an XSLT
History
#1 Updated by Alexey Lavrentev about 5 years ago
- Target version changed from Portal 0.8 to TXM 0.8.0a (split/restructuration)
#2 Updated by Sebastien Jacquot almost 5 years ago
- Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0
#3 Updated by Matthieu Decorde about 4 years ago
- Target version changed from TXM 0.8.0 to TXM 0.8.2
#4 Updated by Matthieu Decorde almost 3 years ago
- Category set to Import
#5 Updated by Matthieu Decorde almost 2 years ago
- % Done changed from 0 to 80
fixed with the new TXMResult objects (CorpusBuild, Text, Edition)