Bug #716

TBX: 0.7.5, Import CWB, entities

Ajouté par Matthieu Decorde il y a plus de 11 ans. Mis à jour il y a plus de 11 ans.

Statut:New Début:24/03/2014
Priorité:Normal Echéance:
Assigné à:- % réalisé:

0%

Catégorie:Import Temps passé: -
Version cible:TXM 0.X.X

Description

To find the structures to declare in the registry file, the WTC file is parsed as XML.
If the WTC file contains "&" the parsing fail with an entity error:

Message: The entity name must immediately follow the '&' in the entity reference.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:592)
    at org.txm.importer.cwb.BuildCwbEncodeArgs.process(BuildCwbEncodeArgs.java:99)
    at org.txm.importer.cwb.BuildCwbEncodeArgs$process.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
    at org.txm.importer.wtc.compiler.run(compiler.groovy:144)

Historique

#1 Mis à jour par Serge Heiden il y a plus de 11 ans

The CWB file, as the ticket title name it (what you call the "WTC" file), should not be parsed as XML
because the CWB format is not based on the XML standard at all:
  • the CWB file format only mimics XML poorly (space between attribute-value pairs is significant, etc.)
  • the CWB format allows to use any XML special character anywhere on a word line (without using any special XML convention)
  • no XML processing instructions are allowed
  • no xi:xinclude element either
  • etc.

The code must only mimict simple XML element parsing for 'XML' like lines only.

Formats disponibles : Atom PDF