Bug #716
TBX: 0.7.5, Import CWB, entities
Statut: | New | Début: | 24/03/2014 | |
---|---|---|---|---|
Priorité: | Normal | Echéance: | ||
Assigné à: | - | % réalisé: | 0% |
|
Catégorie: | Import | Temps passé: | - | |
Version cible: | TXM 0.X.X |
Description
To find the structures to declare in the registry file, the WTC file is parsed as XML.
If the WTC file contains "&" the parsing fail with an entity error:
Message: The entity name must immediately follow the '&' in the entity reference. at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:592) at org.txm.importer.cwb.BuildCwbEncodeArgs.process(BuildCwbEncodeArgs.java:99) at org.txm.importer.cwb.BuildCwbEncodeArgs$process.call(Unknown Source) at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:45) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:108) at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:120) at org.txm.importer.wtc.compiler.run(compiler.groovy:144)
Historique
#1 Mis à jour par Serge Heiden il y a plus de 11 ans
The CWB file, as the ticket title name it (what you call the "WTC" file), should not be parsed as XML
because the CWB format is not based on the XML standard at all:
because the CWB format is not based on the XML standard at all:
- the CWB file format only mimics XML poorly (space between attribute-value pairs is significant, etc.)
- the CWB format allows to use any XML special character anywhere on a word line (without using any special XML convention)
- no XML processing instructions are allowed
- no xi:xinclude element either
- etc.
The code must only mimict simple XML element parsing for 'XML' like lines only.