Bug #2059

RCP: 0.7.8, fix pre-encoded word properties in XML/w+CSV

Added by Serge Heiden almost 4 years ago. Updated about 1 month ago.

Status:New Start date:03/07/2017
Priority:Urgent Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM 0.8.2

Description

Currently, if a <w> element in an XML source pre-encodes a property possibly built by TreeTagger, the TreeTagger properties are added to the word instead of not being touched (pre-encoding has priority over on the fly annotations).
For example, the following XML source :

établissements membres et d’un organisme de recherche associé, l’INSERM.
<w frpos="PUN">■</w>

L’Université Claude Bernard, qui forme chaque année 40 000 étudiants dans les sciences

produces the following TXM text:

établissements membres et d’un organisme de recherche associé, l’INSERM. ■ L’Université Claude Bernard, qui forme chaque année 40 000 étudiants dans les sciences

Where the '■' word properties are :
  • frpos:PUN
  • n:4516
  • frpos:NOM
  • frlemma:■

instead of the correct following TXM text:

établissements membres et d’un organisme de recherche associé, l’INSERM. ■ L’Université Claude Bernard, qui forme chaque année 40 000 étudiants dans les sciences

Where the '■' word properties are :
  • frpos:PUN
  • n:4516
  • frlemma:■

Solution

Add a new import parameter to activate or not the existing annotation correction see for details https://groupes.renater.fr/wiki/txm-info/public/annotation/tal_treetagger

History

#1 Updated by Matthieu Decorde over 3 years ago

  • Priority changed from Normal to Urgent

#2 Updated by Sebastien Jacquot over 2 years ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#3 Updated by Matthieu Decorde almost 2 years ago

  • Target version changed from TXM 0.8.0 to TXM 0.8.2

#4 Updated by Matthieu Decorde about 1 month ago

  • Description updated (diff)

Also available in: Atom PDF