Bug #2098

TBX: 0.7.8, XTZ import, <num> and <w> tags indexed even if they are located in an element declared in the 'out-of text-to-edit' plan

Added by Alexey Lavrentev over 2 years ago. Updated 9 months ago.

Status:New Start date:10/04/2016
Priority:High Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

To reproduce the bug, take strasbBfm.xml from BFM repository, import via XTZ with teiHeader in out-of-text-to-edit and search for [word="[0-9]+"].

  1. <num> element should not be transformed into <w>
  2. no element placed inside "out-of-text-to-edit" should be indexed

Currently to implement the "out-of-text-to-edit" plan, the compiler and the pager steps use the words (w elements) identified by the Tokenizer. So if an "out-of-text-to-edit" plan contains already word tags (<w> or <num>), these are indexed by the search engine.

Solution

The pager and compiler steps must use the "out-of-text-to-edit" plan import parameter instead of relying on the Tokenizer result.


Related issues

related to Bug #2220: TBX: 0.7.8, XTZ import, "out-of text-to-edit" elements to... New 06/16/2017

History

#1 Updated by Alexey Lavrentev over 2 years ago

  • File deleted (cleve-edition.png)

#2 Updated by Matthieu Decorde over 2 years ago

  • Subject changed from TBX: 0.7.8, XTZ import, <num> and <w> tags indexed in 'out-of text-to-edit' plan to TBX: 0.7.8, XTZ import, <num> and <w> tags indexed even if they are declared in the 'out-of text-to-edit' plan
  • Description updated (diff)
  • Priority changed from Normal to High

#3 Updated by Alexey Lavrentev over 2 years ago

  • Subject changed from TBX: 0.7.8, XTZ import, <num> and <w> tags indexed even if they are declared in the 'out-of text-to-edit' plan to TBX: 0.7.8, XTZ import, <num>, <w> and <author> tags indexed even if they are located in an element declared in the 'out-of text-to-edit' plan

Similar behavior is caused by the <note> element. The text nodes followind </note> are tokenized and idexed even if they are inside an element declared as out-of-text-to-edit. See the related ticket.

#4 Updated by Alexey Lavrentev over 2 years ago

  • Subject changed from TBX: 0.7.8, XTZ import, <num>, <w> and <author> tags indexed even if they are located in an element declared in the 'out-of text-to-edit' plan to TBX: 0.7.8, XTZ import, <num> and <w> tags indexed even if they are located in an element declared in the 'out-of text-to-edit' plan

#5 Updated by Sebastien Jacquot over 1 year ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#6 Updated by Matthieu Decorde 9 months ago

  • Target version changed from TXM 0.8.0 to TXM X.X

Also available in: Atom PDF