Bug #2467
RCP: 0.7.9, newline in word property values breaks CQP indexing
Status: | New | Start date: | 10/30/2018 | |
---|---|---|---|---|
Priority: | High | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Import | Spent time: | - | |
Target version: | TXM X.X |
Description
Word property values can come from external sources or from XSLT pre-processing results, not only from TXM itself.
If word property values contain newlines in the XML-TXM representation, the CQP representation includes them also and the cwb-encode indexing is broken.
Example¶
The following XML-TXM, where the 'ref' property values contain one newline:
<p> <w id="w_rd0010_1"><txm:form>–</txm:form><txm:ana resp="none" type="#n">1</txm:ana><txm:ana resp="none" type="#ref"> filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1</txm:ana><txm:ana resp="#txm" type="#frpos">PUN</txm:ana><txm:ana resp="#txm" type="#frlemma">–</txm:ana></w> <w id="w_rd0010_2"><txm:form>Duchesse</txm:form><txm:ana resp="none" type="#n">2</txm:ana><txm:ana resp="none" type="#ref"> filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2</txm:ana><txm:ana resp="#txm" type="#frpos">NAM</txm:ana><txm:ana resp="#txm" type="#frlemma">Duchesse</txm:ana></w> <w id="w_rd0010_3"><txm:form>!</txm:form><txm:ana resp="none" type="#n">3</txm:ana><txm:ana resp="none" type="#ref"> filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3</txm:ana><txm:ana resp="#txm" type="#frpos">SENT</txm:ana><txm:ana resp="#txm" type="#frlemma">!</txm:ana></w> </p>
Produces the following CQP, which breaks proper cwb-encoding:
<p n="0"> – w_rd0010_1 filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1 PUN – 1 Duchesse w_rd0010_2 filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2 NAM Duchesse 2 ! w_rd0010_3 filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3 SENT ! 3 </p>
Solution¶
Replace all newline characters in word property values by a space character at least before producing the CQP representation.
History
#1 Updated by Serge Heiden almost 5 years ago
- Subject changed from RCP: 0.7.9, newline in word properties breaks CQP indexing to RCP: 0.7.9, newline in word property values breaks CQP indexing
#2 Updated by Matthieu Decorde over 4 years ago
- Target version changed from TXM 0.8.0 to TXM X.X