Feature #1548: RCP: X.X, add XTZ import module - Plateforme TXM - Forge du Centre Blaise Pascal

Feature #1548

Mis à jour par Matthieu Decorde il y a presque 10 ans

* copy XML/w+CSV import to XTZ+CSV
** entry menu
** scripts in scripts/import
* add new source directory sub-directories management
** 'dtd' sub-directory contains the dtd files to use with XSLs.
*** See http://docs.oracle.com/javase/7/docs/api/javax/xml/stream/XMLInputFactory.html
** 'css' sub-directory contains the css files to use with HTML pages in editions
*** MD: the pager must declare the css files in each HTML page with a path "css/cssfilename.css"
*** MD: the css directory must be copied next to the HTML pages for *each edition* (Groovy or XSL)
** 'xsl' sub-directory contains different types of XSL sub-directories (if a directory is absent or empty it is not used)
*** '1-split-merge' sub-subdirectory containing an XSL stylesheet used to split or merge source files to adapt them to the TXM corpus model (1 text = 1 file)
**** this XSL receives a "binary-src-dir-path" parameter with a path to write result files
**** the standard XSL output file of this stylesheet *is not used*
**** examples: split-texts.xsl or merge-files.xsl
*** '2-front' sub-sub-directory containing the XSL stylesheets to process the sources at the beginning of the import process (replaces the 'front XSL' section mecanism). The XSL are applied in the lexicographical order of their file names.
**** examples: txm-filter-teip5-xmlw-preserve.xsl
*** '3-posttok' sub-sub-directory containing the XSL stylesheets to process the xml-txm representation of the sources after the tokenization phase (all words are encoded). The XSL are applied in the lexicographical order of their file names.
**** examples: reduce-caesura.xsl, build-word-ref.xsl
*** '4-edition' sub-sub-directory containing the XSL stylesheets to build the HTML edition from the xml-txm representation using the pagination done by the pager. The XSL are applied in the lexicographical order of their file names.
**** example: in order, 1-default-html.xsl, 2-default-pager.xsl, to build the 'default' edition followed by, 3-facs-html.xsl, 4-facs-html.xsl to build the facsimile edition to hold the images
**** all XSL receive the following parameters: "number-words-per-page", "pagination-element", "import-xml-path".
***** Note: this XSL parameters are not mandatory (MD: tested)
**** The XSL file writes the first word ID in each HTML file produced : <pre><meta name="description" content="{id du 1er lmot}"/></pre>. If there is no word in the page, then the "content" value is "w_0"
**** Their file name is used to name the edition produced
** all sub-directories are copied to the binary corpus
* modify the import form :
** add section "Plans textuels"
*** liste des balises codant le hors-texte (ni indexé ni édité) (transform to Regexp)
**** MD : ajout d'un paramètre d'import "element.ignored.always" (anciennement codé dans des fichiers properties)
*** liste des balises codant le hors-texte à éditer (affichées dans l'édition) (transform to Regexp)
**** MD : ajout d'un paramètre d'import "element.edited.only" (anciennement codé dans des fichiers properties)
** remove "front XSL" section
*** note: "add parameter" is broken
** move "font" section after "editions" and before "commands"
** modify Éditions section
*** "Editions" -> "Éditions"
*** add 'images' URI declaration (see below)
<pre>
[x] Construire l'édition

Nombre de mots par page [500] Élément de pagination [pb]
Répertoire local d'images de facsimilés [...]
</pre>

* transfert the edition macros into the XTZ import module

*Later*

Integrate the XMLText2MetadataCSV macro content to pull metadata from teiHeaders directly.

Retour

Laboratoire ICAR » Plateforme TXM

Feature #1548