Feature #1535

TBX: X.X, import ODT/DOC/RTF

Added by Matthieu Decorde almost 4 years ago. Updated over 3 years ago.

Status:New Start date:09/29/2015
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Import Spent time: -
Target version:TXM 0.7.8

Description

see: https://groupes.renater.fr/wiki/txm-info/public/import_odt

Abstract:
  • Very useful import for beginners
  • Import the most common document file format: ODT, DOC and RTF
  • Allow user to use a metadata.csv file
  • Import name : ODT/DOC/RTF + CSV
  • entry menu : after the TXT + CSV import

Solution 1

  • Plug the actual ODT/DOC/RTF prototype import into TXM even if there are still some minor bugs
  • add metadata.csv support
  • add a new section to the import modules section in the manual, describing what is transfered from the ODT document to the TXM corpus+edition
  • test the VOEUX in .odt sample corpus from the TXM import workshop support archive

Tuning console log messages

Note: This report applies to nearly all import modules console logs.

This report is interlaced in a sample odt corpus import console log:

  • Le dossier des sources ne contient pas de fichier 'import.xml'. Un nouveau a été créé.
    
    This message is not useful (it concerns the developer only, because the user is not supposed to use that file directly. The user is concerned only if the 'import.xml' file is impossible to create) -> remove?
    MD: replaced by a Log.info() message
  • Chargement des paramètres d'import depuis le fichier : /home/sheiden/Corpus/src/odtsample/import.xml
    
    This message should not be displayed if the previous one is.

I suggest to remove it completely (not useful because the user can see if the parameters input form is correctly pre-set or not).
MD: replaced by a Log.info() message

  • Sauvegarde des paramètres d'importation...
     Tokenizer parametrized with whitespaces=[\p{Z}\p{C}]+
     Tokenizer parametrized with regPunct=[\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{Po}\p{S}]
     Tokenizer parametrized with punct_strong=[.!?]+|\.\.|\.\.\.|…|\|
     Tokenizer parametrized with regElision=['‘’]
    
    This message is not useful (this message concerns the developer for the tokenizer only because some other parameters values could also be displayed here) -> remove
    MD: replaced by a Log.info() message
  • Execution du script : /home/sheiden/TXM/scripts/import/docLoader.groovy
    
    Unless the user is supposed to call that script directly at one moment, this message is not useful -> remove
    MD: replaced by a Log.info() message
  • Converting DOC files to TEI
    .........
    
    I suggest to introduce a new section equivalent to "-- IMPORTER - Reading source files" below.

For example: "-- CONVERTER - Converting source files"
MD: ok for --CONVERTER

"Converting DOC files to TEI" should be replaced by "Converting DOC files to XML" (sources are later imported by XML/w+CSV, not by a TEI import module - even if that XML is TEI based).

  • Retrieve data folders and style files
    
    -> "Retrieving data folders and style files"
    MD: ok
  • [/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/files-graf.xml...]
    
    -> "[TXM Leaflet FR.xml, atelier-txm.xml, tuto R TXM.xml, files-graf.xml...]"

Note: Currently, source file lists are sorted in various ways along the import module console messages -> the file names lists should all be sorted alphabetically before being displayed to get consistent logs.

(the real path to the corpora directory is not useful unless the user is supposed to use it. If needed, it could be displayed only once to keep the console message readable. I suggest to remove it from log)

  • zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-graf.xml
    StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-graf.xml/style.css
    PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/graf.xml
    PRINT FIRST NAMESPACE
    zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-tuto R TXM.xml
    StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-tuto R TXM.xml/style.css
    PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml
    PRINT FIRST NAMESPACE
    zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml
    StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml/style.css
    PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml
    PRINT FIRST NAMESPACE
    ...
    
    These console messages concern the developer only -> remove
    MD: ok
  • Setting new root element
    .........
    
    -> "Setting new XML root element"
    MD: ok
  • Removing tei:terms
    Filtering XML files with xpaths: [//tei:term]
    
    Replace the two lines by: "Filtering some XML elements by XPaths: //tei:term."
    MD: ok
  • Trying to read metadata from: /home/sheiden/Corpus/src/odtsample/metadata.csv
    
    Already reported in another ticket: this message is not useful -> remove
    MD: ok
  • no metadata file: /home/sheiden/Corpus/src/odtsample/metadata.csv
    
    -> "No /home/sheiden/Corpus/src/odtsample/metadata.csv metadata file found."
    MD: ok
  • -- IMPORTER - Reading source files
    Sources clean & validation
    .........
    
    "Sources clean & validation" -> "Sources cleaning & validation"
    MD: ok
  • Files processed: [/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/Rapport.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml...]
    
    -> "Files processed: CMSMcQ-report-1996-05-23.xml, Rapport.xml, TXM Leaflet FR.xml, atelier-txm.xml...]"
  • Tokenizing 9 files
    .........
    Building XML-TXM (9 files)
    .........
    
    "Building XML-TXM (9 files)" -> "Building XML-TXM pivot representation (9 files)"
  • -- INJECTING METADATA - from csv file: /home/sheiden/Corpus/src/odtsample/metadata.csv
    
    This message should not be displayed if there is no metadata.csv file.
    MD: ok
  • Applying fr.par TreeTagger model on dir: /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger (9 files)
    .........
    
    -> "Applying fr.par TreeTagger model on /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger directory files (9 files)"
  • Building stdoff files (9) from dir:/home/sheiden/TXM/corpora/ODTSAMPLE/treetagger to /home/sheiden/TXM/corpora/ODTSAMPLE/annotations
    .........
    
    -> "Building standoff representation for /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger directory files (9) in /home/sheiden/TXM/corpora/ODTSAMPLE/annotations directory"
  • Injecting stdoff files (9) data from /home/sheiden/TXM/corpora/ODTSAMPLE/annotations to xml-txm files of /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE
    .........
    
    -> "Injecting standoff data from /home/sheiden/TXM/corpora/ODTSAMPLE/annotations directory files (9) in /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE directory XML-TXM files"
  • Compiling 9 [/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/Rapport.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/benzidane_150102a.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/graf.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/participants 23-05-2014.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/wyeiwyg.xml] 
    .........
    
    -> "Compiling 9 files: CMSMcQ-report-1996-05-23.xml, Rapport.xml, TXM Leaflet FR.xml, atelier-txm.xml..."
  • P-attributes: [id, frpos, frlemma, n, type]
    
    -> "Word properties: id, frpos, frlemma, n, type."
  • S-attributes: [anchor:0+id+n, body:0+n, div:1+rend+type+n, emph:0+n, figure:0+n, graphic:0+url+n, head:1+n, hi:0+rend+n, item:8+n, lb:0+n, list:9+type+n, p:0+rend+n, pb:0+n, ptr:0+target+n, ref:0+target+n, text:0+id+base+project, txmcorpus:0+lang]
    
    -> "Structures properties: anchor@id+n, body@n, div@rend+type+n, emph@n, figure@n, graphic@url+n..."
  • -- EDITION - Building edition
    Paginating texts: 
    .........Copying internal images...
    ./home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml/Pictures
    ./home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures
    
    ->
    -- EDITION - Building edition
    Paginating texts: 
    .........
    Copying internal images...
    CMSMcQ-report-1996-05-23.xml/Pictures, Rapport.xml/Pictures...
    
  • Fail to copy /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures to /home/sheiden/TXM/corpora/ODTSAMPLE/HTML/ODTSAMPLE/default/Pictures
    
    -> "Failed to copy /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures to /home/sheiden/TXM/corpora/ODTSAMPLE/HTML/ODTSAMPLE/default/Pictures"
  • Importation terminée : 20 sec (20162 ms)
    
  • Moteur de recherche lancé.
    
    -> remove?
  • Moteur statistique lancé.connecté.
    
    -> remove?
  • Chargement des sous-corpus et des partitions...Terminé.
    
  • TXM est prêt.
    
    -> "Le corpus ODTSAMPLE est prêt."

Related issues

related to Support #692: QB: Transcription document to Transcriber macro, Mac OS X... New 03/20/2014

History

#1 Updated by Serge Heiden almost 4 years ago

  • Description updated (diff)

#2 Updated by Matthieu Decorde almost 4 years ago

  • % Done changed from 0 to 80

#3 Updated by Serge Heiden almost 4 years ago

  • Description updated (diff)
  • % Done changed from 80 to 70

#4 Updated by Matthieu Decorde almost 4 years ago

  • Description updated (diff)

#5 Updated by Matthieu Decorde over 3 years ago

  • % Done changed from 70 to 80

Also available in: Atom PDF