Bug #716

TBX: 0.7.5, Import CWB, entities

Added by Matthieu Decorde about 5 years ago. Updated about 5 years ago.

Status:New Start date:03/24/2014
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

To find the structures to declare in the registry file, the WTC file is parsed as XML.
If the WTC file contains "&" the parsing fail with an entity error:

Message: The entity name must immediately follow the '&' in the entity reference.
    at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:592)
    at org.txm.importer.cwb.BuildCwbEncodeArgs.process(BuildCwbEncodeArgs.java:99)
    at org.txm.importer.cwb.BuildCwbEncodeArgs$process.call(Unknown Source)
    at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108)
    at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120)
    at org.txm.importer.wtc.compiler.run(compiler.groovy:144)

History

#1 Updated by Serge Heiden about 5 years ago

The CWB file, as the ticket title name it (what you call the "WTC" file), should not be parsed as XML
because the CWB format is not based on the XML standard at all:
  • the CWB file format only mimics XML poorly (space between attribute-value pairs is significant, etc.)
  • the CWB format allows to use any XML special character anywhere on a word line (without using any special XML convention)
  • no XML processing instructions are allowed
  • no xi:xinclude element either
  • etc.

The code must only mimict simple XML element parsing for 'XML' like lines only.

Also available in: Atom PDF