Bug #2258

RCP: 0.7.8, XMLW and XTZ import modules, line breaks trimmed causing tokenization errors

Added by Alexey Lavrentev almost 6 years ago. Updated over 2 years ago.

Status:New Start date:10/09/2017
Priority:Urgent Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.8.2

Description

In text nodes the new lines are trimmed and hense words on different lines are merged unless there is a white space before the new line.
To reproduce the bug, use the following test file to see that "ouperaction" appears in the lexicon:

<text>
Tout art et toute doctrine et semblablement tout fait ou
operacion et eleccion appetent et desirent aucun bien. Pour
ce parloient bien les anciens en disant ainsi: " Bien est ce
que toutes choses desirent. " Et semble que il est difference
de fins; car les unes fins sont les operacions, les autres sont
</text>

It looks like the trimming happens before the file is sent to XSL filters, so it is impossible to use XSL to fix the problem.

Solution

  1. Replace the new line with a space (ideally unless preceded or followed by another white space)
  2. Trim the new lines after XSLT filters application

History

#1 Updated by Alexey Lavrentev almost 6 years ago

  • Description updated (diff)

#2 Updated by Alexey Lavrentev almost 6 years ago

Le bug semble résolu (TXM 0.7.8.201712011718). Mettre à jour l'état d'avancement ?

#3 Updated by Alexey Lavrentev over 5 years ago

The test file works fine but the problem persist when trying to catch line breaks in XTZ XSL filters

#4 Updated by Sebastien Jacquot over 5 years ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#5 Updated by Matthieu Decorde over 4 years ago

  • Target version changed from TXM 0.8.0 to TXM 0.8.2

#6 Updated by Matthieu Decorde about 3 years ago

  • Category set to Import

#7 Updated by Matthieu Decorde over 2 years ago

  • % Done changed from 0 to 80

Also available in: Atom PDF