Bug #2467
RCP: 0.7.9, newline in word property values breaks CQP indexing
Statut: | New | Début: | 30/10/2018 | |
---|---|---|---|---|
Priorité: | High | Echéance: | ||
Assigné à: | - | % réalisé: | 0% |
|
Catégorie: | Import | Temps passé: | - | |
Version cible: | TXM 0.X.X |
Description
Word property values can come from external sources or from XSLT pre-processing results, not only from TXM itself.
If word property values contain newlines in the XML-TXM representation, the CQP representation includes them also and the cwb-encode indexing is broken.
Example¶
The following XML-TXM, where the 'ref' property values contain one newline:
<p> <w id="w_rd0010_1"><txm:form>–</txm:form><txm:ana resp="none" type="#n">1</txm:ana><txm:ana resp="none" type="#ref"> filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1</txm:ana><txm:ana resp="#txm" type="#frpos">PUN</txm:ana><txm:ana resp="#txm" type="#frlemma">–</txm:ana></w> <w id="w_rd0010_2"><txm:form>Duchesse</txm:form><txm:ana resp="none" type="#n">2</txm:ana><txm:ana resp="none" type="#ref"> filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2</txm:ana><txm:ana resp="#txm" type="#frpos">NAM</txm:ana><txm:ana resp="#txm" type="#frlemma">Duchesse</txm:ana></w> <w id="w_rd0010_3"><txm:form>!</txm:form><txm:ana resp="none" type="#n">3</txm:ana><txm:ana resp="none" type="#ref"> filedir = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3</txm:ana><txm:ana resp="#txm" type="#frpos">SENT</txm:ana><txm:ana resp="#txm" type="#frlemma">!</txm:ana></w> </p>
Produces the following CQP, which breaks proper cwb-encoding:
<p n="0"> – w_rd0010_1 filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 1 PUN – 1 Duchesse w_rd0010_2 filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 2 NAM Duchesse 2 ! w_rd0010_3 filedire = file:/home/sheiden/TXM/corpora/ELTECFRAORIG/tokenized, metadata=, 3 SENT ! 3 </p>
Solution¶
Replace all newline characters in word property values by a space character at least before producing the CQP representation.
Historique
#1 Mis à jour par Serge Heiden il y a presque 7 ans
- Sujet changé de RCP: 0.7.9, newline in word properties breaks CQP indexing à RCP: 0.7.9, newline in word property values breaks CQP indexing
#2 Mis à jour par Matthieu Decorde il y a plus de 6 ans
- Version cible changé de TXM 0.8.0 à TXM 0.X.X