Feature #1535

Updated by Matthieu Decorde almost 4 years ago

see: https://groupes.renater.fr/wiki/txm-info/public/import_odt

Abstract:
* Very useful import for beginners
* Import the most common document file format: ODT, DOC and RTF
* Allow user to use a metadata.csv file
* Import name : ODT/DOC/RTF + CSV
* entry menu : after the TXT + CSV import

h3. Solution 1

* Plug the actual ODT/DOC/RTF prototype import into TXM even if there are still some minor bugs
* add metadata.csv support
* add a new section to the import modules section in the manual, describing what is transfered from the ODT document to the TXM corpus+edition
* test the VOEUX in .odt sample corpus from the TXM import workshop support archive

h3. Tuning console log messages

*Note*: This report applies to nearly all import modules console logs.

This report is interlaced in a sample odt corpus import console log:

* <pre>
Le dossier des sources ne contient pas de fichier 'import.xml'. Un nouveau a été créé.
</pre> This message is not useful (it concerns the developer only, because the user is not supposed to use that file directly. The user is concerned only if the 'import.xml' file is impossible to create) -> remove?
*MD*: replaced by a Log.info() message


* <pre>
Chargement des paramètres d'import depuis le fichier : /home/sheiden/Corpus/src/odtsample/import.xml
</pre> This message should not be displayed if the previous one is.

I suggest to remove it completely (not useful because the user can see if the parameters input form is correctly correcly pre-set or not).
*MD*: replaced by a Log.info() message


* <pre>
Sauvegarde des paramètres d'importation...
Tokenizer parametrized with whitespaces=[\p{Z}\p{C}]+
Tokenizer parametrized with regPunct=[\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{Po}\p{S}]
Tokenizer parametrized with punct_strong=[.!?]+|\.\.|\.\.\.|…|\|
Tokenizer parametrized with regElision=['‘’]
</pre> This message is not useful (this message concerns the developer for the tokenizer only because some other parameters values could also be displayed here) -> remove
*MD*: replaced by a Log.info() message


* <pre>
Execution du script : /home/sheiden/TXM/scripts/import/docLoader.groovy
</pre> Unless the user is supposed to call that script directly at one moment, this message is not useful -> remove
*MD*: replaced by a Log.info() message


* <pre>
Converting DOC files to TEI
.........
</pre> I suggest to introduce a new section equivalent to "-- IMPORTER - Reading source files" below.

For example: "-- CONVERTER - Converting source files"
*MD*: ok for --CONVERTER


"Converting DOC files to TEI" should be replaced by "Converting DOC files to XML" (sources are later imported by XML/w+CSV, not by a TEI import module - even if that XML is TEI based).

* <pre>
Retrieve data folders and style files
</pre> -> "Retrieving data folders and style files"
*MD*: ok


* <pre>
[/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/files-graf.xml...]
</pre> -> "[TXM Leaflet FR.xml, atelier-txm.xml, tuto R TXM.xml, files-graf.xml...]"

*Note*: Currently, source file lists are sorted in various ways along the import module console messages -> the file names lists should all be sorted alphabetically before being displayed to get consistent logs.

(the real path to the corpora directory is not useful unless the user is supposed to use it. If needed, it could be displayed only once to keep the console message readable. I suggest to remove it from log)

* <pre>
zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-graf.xml
StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-graf.xml/style.css
PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/graf.xml
PRINT FIRST NAMESPACE
zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-tuto R TXM.xml
StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-tuto R TXM.xml/style.css
PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml
PRINT FIRST NAMESPACE
zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml
StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml/style.css
PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml
PRINT FIRST NAMESPACE
...
</pre> These console messages concern the developer only -> remove
*MD*: ok


* <pre>
Setting new root element
.........
</pre> -> "Setting new XML root element"
*MD*: ok


* <pre>
Removing tei:terms
Filtering XML files with xpaths: [//tei:term]
</pre> Replace the two lines by: "Filtering some XML elements by XPaths: //tei:term."
*MD*: ok


* <pre>
Trying to read metadata from: /home/sheiden/Corpus/src/odtsample/metadata.csv
</pre> Already reported in another ticket: this message is not useful -> remove
*MD*: ok


* <pre>
no metadata file: /home/sheiden/Corpus/src/odtsample/metadata.csv
</pre> -> "No /home/sheiden/Corpus/src/odtsample/metadata.csv metadata file found."
*MD*: ok


* <pre>
-- IMPORTER - Reading source files
Sources clean & validation
.........
</pre> "Sources clean & validation" -> "Sources cleaning & validation"
*MD*: ok


* <pre>
Files processed: [/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/Rapport.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml...]
</pre> -> "Files processed: CMSMcQ-report-1996-05-23.xml, Rapport.xml, TXM Leaflet FR.xml, atelier-txm.xml...]"

* <pre>
Tokenizing 9 files
.........
Building XML-TXM (9 files)
.........
</pre> "Building XML-TXM (9 files)" -> "Building XML-TXM pivot representation (9 files)"

* <pre>
-- INJECTING METADATA - from csv file: /home/sheiden/Corpus/src/odtsample/metadata.csv
</pre> This message should not be displayed if there is no metadata.csv file.
*MD*: ok


* <pre>
Applying fr.par TreeTagger model on dir: /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger (9 files)
.........
</pre> -> "Applying fr.par TreeTagger model on /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger directory files (9 files)"

* <pre>
Building stdoff files (9) from dir:/home/sheiden/TXM/corpora/ODTSAMPLE/treetagger to /home/sheiden/TXM/corpora/ODTSAMPLE/annotations
.........
</pre> -> "Building standoff representation for /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger directory files (9) in /home/sheiden/TXM/corpora/ODTSAMPLE/annotations directory"

* <pre>
Injecting stdoff files (9) data from /home/sheiden/TXM/corpora/ODTSAMPLE/annotations to xml-txm files of /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE
.........
</pre> -> "Injecting standoff data from /home/sheiden/TXM/corpora/ODTSAMPLE/annotations directory files (9) in /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE directory XML-TXM files"

* <pre>
Compiling 9 [/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/Rapport.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/benzidane_150102a.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/graf.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/participants 23-05-2014.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/wyeiwyg.xml]
.........
</pre> -> "Compiling 9 files: CMSMcQ-report-1996-05-23.xml, Rapport.xml, TXM Leaflet FR.xml, atelier-txm.xml..."

* <pre>
P-attributes: [id, frpos, frlemma, n, type]
</pre> -> "Word properties: id, frpos, frlemma, n, type."

* <pre>
S-attributes: [anchor:0+id+n, body:0+n, div:1+rend+type+n, emph:0+n, figure:0+n, graphic:0+url+n, head:1+n, hi:0+rend+n, item:8+n, lb:0+n, list:9+type+n, p:0+rend+n, pb:0+n, ptr:0+target+n, ref:0+target+n, text:0+id+base+project, txmcorpus:0+lang]
</pre> -> "Structures properties: anchor@id+n, body@n, div@rend+type+n, emph@n, figure@n, graphic@url+n..."

* <pre>
-- EDITION - Building edition
Paginating texts:
.........Copying internal images...
./home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml/Pictures
./home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures
</pre> ->
<pre>
-- EDITION - Building edition
Paginating texts:
.........
Copying internal images...
CMSMcQ-report-1996-05-23.xml/Pictures, Rapport.xml/Pictures...
</pre>

* <pre>
Fail to copy /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures to /home/sheiden/TXM/corpora/ODTSAMPLE/HTML/ODTSAMPLE/default/Pictures
</pre> -> "Failed to copy /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures to /home/sheiden/TXM/corpora/ODTSAMPLE/HTML/ODTSAMPLE/default/Pictures"

* <pre>
Importation terminée : 20 sec (20162 ms)
</pre>

* <pre>
Moteur de recherche lancé.
</pre> -> remove?

* <pre>
Moteur statistique lancé.connecté.
</pre> -> remove?

* <pre>
Chargement des sous-corpus et des partitions...Terminé.
</pre>

* <pre>
TXM est prêt.
</pre> -> "Le corpus ODTSAMPLE est prêt."

Back