Bug #1179

TBX: x.x, end of line in word forms encoded with <w> tag

Added by Serge Heiden over 4 years ago. Updated 11 months ago.

Status:Feedback Start date:12/10/2014
Priority:Normal Due date:
Assignee:- % Done:

70%

Category:Import Spent time: -
Target version:TXM 0.8.0

Description

In XML format sources, when a word is pre-encoded with a <w>...</w> tag
and the word form contains an end of line, the resulting word form is incorrect
because the end of line is just removed from the graphic form.

For example: <w>parce
que</w> gives 'parceque' word form, instead of 'parce que'.

Solution 1

Replace any 'new-line' and 'tabulation' by 'space' character at tokenization level.

MD: 80% -> 70% must check if the unicode class are used

Solution 2

Replace any white character as defined by Java by a 'space' character.

Java white characters are defined by the "isWhitespace method":(http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-)

History

#1 Updated by Serge Heiden over 4 years ago

  • Description updated (diff)

#2 Updated by Matthieu Decorde over 4 years ago

  • % Done changed from 0 to 80

#3 Updated by Matthieu Decorde over 4 years ago

  • Status changed from New to Feedback

#4 Updated by Serge Heiden over 4 years ago

  • Description updated (diff)

#5 Updated by Matthieu Decorde about 4 years ago

  • Description updated (diff)
  • Target version changed from TXM 0.7.7 to TXM 0.7.8
  • % Done changed from 80 to 70

#6 Updated by Matthieu Decorde over 3 years ago

  • Target version changed from TXM 0.7.8 to TXM 0.8.0a (split/restructuration)

#7 Updated by Sebastien Jacquot 11 months ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

Also available in: Atom PDF