Bug #2467

RCP: 0.7.9, newline in word property values breaks CQP indexing

Added by Serge Heiden almost 5 years ago. Updated over 4 years ago.

Status:New Start date:10/30/2018
Priority:High Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

Word property values can come from external sources or from XSLT pre-processing results, not only from TXM itself.

If word property values contain newlines in the XML-TXM representation, the CQP representation includes them also and the cwb-encode indexing is broken.

Example

The following XML-TXM, where the 'ref' property values contain one newline:

<p>
<w id="w_rd0010_1"><txm:form>–</txm:form><txm:ana resp="none" type="#n">1</txm:ana><txm:ana resp="none" type="#ref">
        filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1</txm:ana><txm:ana resp="#txm" type="#frpos">PUN</txm:ana><txm:ana resp="#txm" type="#frlemma">–</txm:ana></w>
<w id="w_rd0010_2"><txm:form>Duchesse</txm:form><txm:ana resp="none" type="#n">2</txm:ana><txm:ana resp="none" type="#ref">
        filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2</txm:ana><txm:ana resp="#txm" type="#frpos">NAM</txm:ana><txm:ana resp="#txm" type="#frlemma">Duchesse</txm:ana></w>
<w id="w_rd0010_3"><txm:form>!</txm:form><txm:ana resp="none" type="#n">3</txm:ana><txm:ana resp="none" type="#ref">
        filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3</txm:ana><txm:ana resp="#txm" type="#frpos">SENT</txm:ana><txm:ana resp="#txm" type="#frlemma">!</txm:ana></w>
</p>

Produces the following CQP, which breaks proper cwb-encoding:

<p n="0">
–       w_rd0010_1      
        filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1  PUN     –       1
Duchesse        w_rd0010_2      
        filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2  NAM     Duchesse        2
!       w_rd0010_3      
        filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3  SENT    !       3
</p>

Solution

Replace all newline characters in word property values by a space character at least before producing the CQP representation.

History

#1 Updated by Serge Heiden almost 5 years ago

  • Subject changed from RCP: 0.7.9, newline in word properties breaks CQP indexing to RCP: 0.7.9, newline in word property values breaks CQP indexing

#2 Updated by Matthieu Decorde over 4 years ago

  • Target version changed from TXM 0.8.0 to TXM X.X

Also available in: Atom PDF