Bug #2467

RCP: 0.7.9, newline in word property values breaks CQP indexing

Ajouté par Serge Heiden il y a presque 7 ans. Mis à jour il y a plus de 6 ans.

Statut:New Début:30/10/2018
Priorité:High Echéance:
Assigné à:- % réalisé:

0%

Catégorie:Import Temps passé: -
Version cible:TXM 0.X.X

Description

Word property values can come from external sources or from XSLT pre-processing results, not only from TXM itself.

If word property values contain newlines in the XML-TXM representation, the CQP representation includes them also and the cwb-encode indexing is broken.

Example

The following XML-TXM, where the 'ref' property values contain one newline:

<p>
<w id="w_rd0010_1"><txm:form>–</txm:form><txm:ana resp="none" type="#n">1</txm:ana><txm:ana resp="none" type="#ref">
        filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1</txm:ana><txm:ana resp="#txm" type="#frpos">PUN</txm:ana><txm:ana resp="#txm" type="#frlemma">–</txm:ana></w>
<w id="w_rd0010_2"><txm:form>Duchesse</txm:form><txm:ana resp="none" type="#n">2</txm:ana><txm:ana resp="none" type="#ref">
        filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2</txm:ana><txm:ana resp="#txm" type="#frpos">NAM</txm:ana><txm:ana resp="#txm" type="#frlemma">Duchesse</txm:ana></w>
<w id="w_rd0010_3"><txm:form>!</txm:form><txm:ana resp="none" type="#n">3</txm:ana><txm:ana resp="none" type="#ref">
        filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3</txm:ana><txm:ana resp="#txm" type="#frpos">SENT</txm:ana><txm:ana resp="#txm" type="#frlemma">!</txm:ana></w>
</p>

Produces the following CQP, which breaks proper cwb-encoding:

<p n="0">
–       w_rd0010_1      
        filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1  PUN     –       1
Duchesse        w_rd0010_2      
        filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2  NAM     Duchesse        2
!       w_rd0010_3      
        filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3  SENT    !       3
</p>

Solution

Replace all newline characters in word property values by a space character at least before producing the CQP representation.

Historique

#1 Mis à jour par Serge Heiden il y a presque 7 ans

  • Sujet changé de RCP: 0.7.9, newline in word properties breaks CQP indexing à RCP: 0.7.9, newline in word property values breaks CQP indexing

#2 Mis à jour par Matthieu Decorde il y a plus de 6 ans

  • Version cible changé de TXM 0.8.0 à TXM 0.X.X

Formats disponibles : Atom PDF