Bug #1285
RCP: 0.7.7, import wrong XML format/structure with the XML-TXM TEI importer leads to a crash of cwb-encode without any message
Status: | New | Start date: | 03/25/2015 | |
---|---|---|---|---|
Priority: | Normal | Due date: | ||
Assignee: | - | % Done: | 0% |
|
Category: | Import | Spent time: | - | |
Target version: | TXM X.X |
Description
Import wrong XML format/structure with the XML-TXM TEI importer leads to a crash of cwb-encode without any useful messages. Maybe the behavior is present in all importers.
We may improve it by checking and validating the XML document structure (using a DTD ?) before starting the import(s).
History
#1 Updated by Serge Heiden almost 8 years ago
Currently, there aren't any sanity checks done on data sources.
Import modules should try to build a consistent internal corpus representation, but even if some normalizations are done to make that a reality no diagnostic messages are given for them.
We could use at least two different strategies to make things better:
- a) develop a new public/private 'Lint' command that checks data sources format and semantics, possibly used independently of import modules (the TEI consortium may develop such a strategy, even if it is difficult to say which semantics should be checked in such a general context)
- b) integrate sanity checks at all levels of all import modules, with verbosity or not, with abort depending on a critical level threshold or not
- TXT: diagnose character encoding management (encoding declaration could be false)
- TT: diagnose language model usage (language declaration)
- XML: diagnose XML syntax conformity
- XML: diagnose XML scheme validity
- XML-TEI-TXM: diagnose some XML-TEI-TXM elements semantics
- related to milestones: page numbering, verse numbering...
- related to lexical units: annotations intregrity (pos, lemma...)
- related to structural units: annotations intregrity (numbering...)
- related to textual units: metadata types and values intregrity (dates...)
- related to HTML edition rendering: pagination integrity, image links integrity, etc.
- related to planes integration: out-of-text excerpt, language planes top vocabulary, section titles plane listing
- etc.