Bug #1966

RCP: 0.7.8, word segmentation and typograpy rules broken in XML/w pager

Added by Serge Heiden over 6 years ago. Updated over 6 years ago.

Status:New Start date:12/14/2016
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Edition Spent time: -
Target version:TXM 0.7.8

Description

Since the introduction of new clitic rules management in the tokenizer, some graphical forms are segmented.

For exemple in English language ('en'): I don't -> I do n't

The corresponding word properties are (form/pos/lemma): I/PP/I do/VVP/do n't/RB/n't

The XML/w page outputs the following surface (graphical forms): I do n't

The correct surface should be : I don't -> like in the source

Solution

Introduce clitic management in the typographic rules management in the Page renderer to prevent the space between "do" and "n't".

MD: rules added from clitics : "'s","'re","'ve","'d","'m","'em","'ll","n't"

dont.png (14 kB) Serge Heiden, 12/14/2016 08:57 pm

History

#1 Updated by Serge Heiden over 6 years ago

  • Description updated (diff)

#2 Updated by Serge Heiden over 6 years ago

  • Description updated (diff)

#3 Updated by Matthieu Decorde over 6 years ago

  • Description updated (diff)
  • % Done changed from 0 to 80

Also available in: Atom PDF