Bug #2364

TBX: 0.7.9, build word IDs if not present in w tags for back-to-text when not tokenizing

Added by Serge Heiden about 1 year ago. Updated 3 months ago.

Status:New Start date:04/10/2018
Priority:Urgent Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM 0.8.1

Description

Currently when the 'Tokenization' import option is unchecked, no word IDs management is done. The result is that the back-to-text, URS Unit highlight, etc. functionalities don't work with default text editions. It is a problem because word properties can be imported for different reasons but the back-to-text functionality should not be broken. The w@id attribute has a special status.

Solution

1) Decide a w ID management policy (the decision can be a new import parameter or a new TXM behavior):
  • a0) foreign IDs (coming from the sources) must be compatible with TXM w ID related functionalities otherwise the import must abort (all IDs present, right pattern, etc.)
  • a1) foreign IDs can be mixed with TXM built w IDs to manage, especially, back-to-text -> add IDs to w that don't have an ID and all w ID related functionalities, like back-to-text, must be able to use those IDs
  • or a2) don't mix foreign IDs with TXM built IDs
    • a2.1) force w IDs to TXM built IDs
    • a2.2.1) rename foreign IDs to 'txm:host-id' or 'txm-host-id', etc. and build TXM w IDs with the 'id' attribute
    • a2.2.2) build TXM w IDs with an identifier specific to the corpus, and use that identifier instead of 'id' in all w ID related functionalities, like back-to-text
    • a2.2.3) use the 'txmid' word property name (and later 'txm:id') to force and use TXM private IDs even when foreign ID are present and even if not tokenizing

2) When tokenizing or not tokenizing, apply the a2.2.3 policy on import (and load if possible), ID related functionalities.


Related issues

related to Feature #1636: RCP: X.X, word tag and skip tokenization import parameters New 01/08/2016

History

#1 Updated by Serge Heiden about 1 year ago

  • Subject changed from TBX: 0.7.9, build word IDs if not present in w tags when not tokenizing to TBX: 0.7.9, build word IDs if not present in w tags for back-to-text when not tokenizing

#2 Updated by Serge Heiden about 1 year ago

  • Description updated (diff)
  • Priority changed from Normal to Urgent

#3 Updated by Serge Heiden about 1 year ago

  • Category changed from Edition to Import

#4 Updated by Sebastien Jacquot 12 months ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#5 Updated by Matthieu Decorde 3 months ago

  • Target version changed from TXM 0.8.0 to TXM 0.8.1

Also available in: Atom PDF