Bug #1285

RCP: 0.7.7, import wrong XML format/structure with the XML-TXM TEI importer leads to a crash of cwb-encode without any message

Added by Sebastien Jacquot almost 9 years ago. Updated almost 8 years ago.

Status:New Start date:03/25/2015
Priority:Normal Due date:
Assignee:- % Done:


Category:Import Spent time: -
Target version:TXM X.X


Import wrong XML format/structure with the XML-TXM TEI importer leads to a crash of cwb-encode without any useful messages. Maybe the behavior is present in all importers.
We may improve it by checking and validating the XML document structure (using a DTD ?) before starting the import(s).


#1 Updated by Serge Heiden almost 8 years ago

Currently, there aren't any sanity checks done on data sources.

Import modules should try to build a consistent internal corpus representation, but even if some normalizations are done to make that a reality no diagnostic messages are given for them.

We could use at least two different strategies to make things better:

  • a) develop a new public/private 'Lint' command that checks data sources format and semantics, possibly used independently of import modules (the TEI consortium may develop such a strategy, even if it is difficult to say which semantics should be checked in such a general context)
  • b) integrate sanity checks at all levels of all import modules, with verbosity or not, with abort depending on a critical level threshold or not
Sanity checks could be progressive:
  • TXT: diagnose character encoding management (encoding declaration could be false)
  • TT: diagnose language model usage (language declaration)
  • XML: diagnose XML syntax conformity
  • XML: diagnose XML scheme validity
  • XML-TEI-TXM: diagnose some XML-TEI-TXM elements semantics
    • related to milestones: page numbering, verse numbering...
    • related to lexical units: annotations intregrity (pos, lemma...)
    • related to structural units: annotations intregrity (numbering...)
    • related to textual units: metadata types and values intregrity (dates...)
    • related to HTML edition rendering: pagination integrity, image links integrity, etc.
    • related to planes integration: out-of-text excerpt, language planes top vocabulary, section titles plane listing
    • etc.

Also available in: Atom PDF