Bug #716
TBX: 0.7.5, Import CWB, entities
Status: | New | Start date: | 03/24/2014 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Import | Spent time: | - | |
Target version: | TXM X.X |
Description
To find the structures to declare in the registry file, the WTC file is parsed as XML.
If the WTC file contains "&" the parsing fail with an entity error:
Message: The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:592) at org.txm.importer.cwb.BuildCwbEncodeArgs.process(BuildCwbEncodeArgs.java:99) at org.txm.importer.cwb.BuildCwbEncodeArgs$process.call(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120) at org.txm.importer.wtc.compiler.run(compiler.groovy:144)
History
#1 Updated by Serge Heiden about 9 years ago
The CWB file, as the ticket title name it (what you call the "WTC" file), should not be parsed as XML
because the CWB format is not based on the XML standard at all:
because the CWB format is not based on the XML standard at all:
- the CWB file format only mimics XML poorly (space between attribute-value pairs is significant, etc.)
- the CWB format allows to use any XML special character anywhere on a word line (without using any special XML convention)
- no XML processing instructions are allowed
- no xi:xinclude element either
- etc.
The code must only mimict simple XML element parsing for 'XML' like lines only.