Bug #2220

TBX: 0.7.8, XTZ import, "out-of text-to-edit" elements tokenised and indexed if an OTTO elements contains sub-elements

Added by Alexey Lavrentev over 2 years ago. Updated 7 months ago.

Status:New Start date:06/16/2017
Priority:High Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

If one declares two out-of-text-to-edit elements that may be nested in the document, tokenisation and indexing resume after the nested element inside the out-of-text-to-edit ancestor.

This happens with identical (note // note) or different (teiHeader // note) elements or if an OTTO element contains any other element (head // sic or head //hi).

The reason probably is that tokenization resumes at any end tag of OTTO element.


Related issues

related to Bug #2098: TBX: 0.7.8, XTZ import, <num> and <w> tags indexed even i... New 10/04/2016

History

#1 Updated by Alexey Lavrentev about 2 years ago

  • Subject changed from TBX: 0.7.8, XTZ import, "out-of text-to-edit" elements tokenised and indexed if nested to TBX: 0.7.8, XTZ import, "out-of text-to-edit" elements tokenised and indexed if nested (or if one OTTO element contains another)
  • Description updated (diff)

#2 Updated by Alexey Lavrentev over 1 year ago

  • Subject changed from TBX: 0.7.8, XTZ import, "out-of text-to-edit" elements tokenised and indexed if nested (or if one OTTO element contains another) to TBX: 0.7.8, XTZ import, "out-of text-to-edit" elements tokenised and indexed if an OTTO elements contains sub-elements
  • Description updated (diff)
  • Priority changed from Normal to High

The bug persists in TXM 0.7.9
To reproduce the bug, take the CHARTES_HAIN13 corpus sources from sharedocs/[...]/Cactus/Projets/Textométrie/Corpus/src and import using.
The content of <head> declared as OTTO will be tokenized after <sic> (w/@id="w_chartes_hain13_1")

#3 Updated by Sebastien Jacquot over 1 year ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#4 Updated by Sebastien Jacquot about 1 year ago

  • Category set to Import

#5 Updated by Alexey Lavrentev about 1 year ago

  • Description updated (diff)

#6 Updated by Matthieu Decorde 7 months ago

  • Target version changed from TXM 0.8.0 to TXM X.X

Also available in: Atom PDF