Feature #1535
TBX: X.X, import ODT/DOC/RTF
Status: | New | Start date: | 09/29/2015 | ||
---|---|---|---|---|---|
Priority: | Normal | Due date: | |||
Assignee: | - | % Done: | 80% |
||
Category: | Import | Spent time: | - | ||
Target version: | TXM 0.7.8 |
Description
see: https://groupes.renater.fr/wiki/txm-info/public/import_odt
Abstract:- Very useful import for beginners
- Import the most common document file format: ODT, DOC and RTF
- Allow user to use a metadata.csv file
- Import name : ODT/DOC/RTF + CSV
- entry menu : after the TXT + CSV import
Solution 1¶
- Plug the actual ODT/DOC/RTF prototype import into TXM even if there are still some minor bugs
- add metadata.csv support
- add a new section to the import modules section in the manual, describing what is transfered from the ODT document to the TXM corpus+edition
- test the VOEUX in .odt sample corpus from the TXM import workshop support archive
Tuning console log messages¶
Note: This report applies to nearly all import modules console logs.
This report is interlaced in a sample odt corpus import console log:
Le dossier des sources ne contient pas de fichier 'import.xml'. Un nouveau a été créé.
This message is not useful (it concerns the developer only, because the user is not supposed to use that file directly. The user is concerned only if the 'import.xml' file is impossible to create) -> remove?
MD: replaced by a Log.info() message
Chargement des paramètres d'import depuis le fichier : /home/sheiden/Corpus/src/odtsample/import.xml
This message should not be displayed if the previous one is.
I suggest to remove it completely (not useful because the user can see if the parameters input form is correctly pre-set or not).
MD: replaced by a Log.info() message
Sauvegarde des paramètres d'importation... Tokenizer parametrized with whitespaces=[\p{Z}\p{C}]+ Tokenizer parametrized with regPunct=[\p{Ps}\p{Pe}\p{Pi}\p{Pf}\p{Po}\p{S}] Tokenizer parametrized with punct_strong=[.!?]+|\.\.|\.\.\.|…|\| Tokenizer parametrized with regElision=['‘’]
This message is not useful (this message concerns the developer for the tokenizer only because some other parameters values could also be displayed here) -> remove
MD: replaced by a Log.info() message
Execution du script : /home/sheiden/TXM/scripts/import/docLoader.groovy
Unless the user is supposed to call that script directly at one moment, this message is not useful -> remove
MD: replaced by a Log.info() message
Converting DOC files to TEI .........
I suggest to introduce a new section equivalent to "-- IMPORTER - Reading source files" below.
For example: "-- CONVERTER - Converting source files"
MD: ok for --CONVERTER
"Converting DOC files to TEI" should be replaced by "Converting DOC files to XML" (sources are later imported by XML/w+CSV, not by a TEI import module - even if that XML is TEI based).
Retrieve data folders and style files
-> "Retrieving data folders and style files"
MD: ok
[/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/files-graf.xml...]
-> "[TXM Leaflet FR.xml, atelier-txm.xml, tuto R TXM.xml, files-graf.xml...]"
Note: Currently, source file lists are sorted in various ways along the import module console messages -> the file names lists should all be sorted alphabetically before being displayed to get consistent logs.
(the real path to the corpora directory is not useful unless the user is supposed to use it. If needed, it could be displayed only once to keep the console message readable. I suggest to remove it from log)
zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-graf.xml StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-graf.xml/style.css PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/graf.xml PRINT FIRST NAMESPACE zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-tuto R TXM.xml StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-tuto R TXM.xml/style.css PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml PRINT FIRST NAMESPACE zipdir /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml StylesToCSS: /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml/style.css PATCH : /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml PRINT FIRST NAMESPACE ...
These console messages concern the developer only -> remove
MD: ok
Setting new root element .........
-> "Setting new XML root element"
MD: ok
Removing tei:terms Filtering XML files with xpaths: [//tei:term]
Replace the two lines by: "Filtering some XML elements by XPaths: //tei:term."
MD: ok
Trying to read metadata from: /home/sheiden/Corpus/src/odtsample/metadata.csv
Already reported in another ticket: this message is not useful -> remove
MD: ok
no metadata file: /home/sheiden/Corpus/src/odtsample/metadata.csv
-> "No /home/sheiden/Corpus/src/odtsample/metadata.csv metadata file found."
MD: ok
-- IMPORTER - Reading source files Sources clean & validation .........
"Sources clean & validation" -> "Sources cleaning & validation"
MD: ok
Files processed: [/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/Rapport.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml...]
-> "Files processed: CMSMcQ-report-1996-05-23.xml, Rapport.xml, TXM Leaflet FR.xml, atelier-txm.xml...]"
Tokenizing 9 files ......... Building XML-TXM (9 files) .........
"Building XML-TXM (9 files)" -> "Building XML-TXM pivot representation (9 files)"
-- INJECTING METADATA - from csv file: /home/sheiden/Corpus/src/odtsample/metadata.csv
This message should not be displayed if there is no metadata.csv file.
MD: ok
Applying fr.par TreeTagger model on dir: /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger (9 files) .........
-> "Applying fr.par TreeTagger model on /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger directory files (9 files)"
Building stdoff files (9) from dir:/home/sheiden/TXM/corpora/ODTSAMPLE/treetagger to /home/sheiden/TXM/corpora/ODTSAMPLE/annotations .........
-> "Building standoff representation for /home/sheiden/TXM/corpora/ODTSAMPLE/treetagger directory files (9) in /home/sheiden/TXM/corpora/ODTSAMPLE/annotations directory"
Injecting stdoff files (9) data from /home/sheiden/TXM/corpora/ODTSAMPLE/annotations to xml-txm files of /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE .........
-> "Injecting standoff data from /home/sheiden/TXM/corpora/ODTSAMPLE/annotations directory files (9) in /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE directory XML-TXM files"
Compiling 9 [/home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/CMSMcQ-report-1996-05-23.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/Rapport.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/TXM Leaflet FR.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/atelier-txm.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/benzidane_150102a.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/graf.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/participants 23-05-2014.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/tuto R TXM.xml, /home/sheiden/TXM/corpora/ODTSAMPLE/txm/ODTSAMPLE/wyeiwyg.xml] .........
-> "Compiling 9 files: CMSMcQ-report-1996-05-23.xml, Rapport.xml, TXM Leaflet FR.xml, atelier-txm.xml..."
P-attributes: [id, frpos, frlemma, n, type]
-> "Word properties: id, frpos, frlemma, n, type."
S-attributes: [anchor:0+id+n, body:0+n, div:1+rend+type+n, emph:0+n, figure:0+n, graphic:0+url+n, head:1+n, hi:0+rend+n, item:8+n, lb:0+n, list:9+type+n, p:0+rend+n, pb:0+n, ptr:0+target+n, ref:0+target+n, text:0+id+base+project, txmcorpus:0+lang]
-> "Structures properties: anchor@id+n, body@n, div@rend+type+n, emph@n, figure@n, graphic@url+n..."
-- EDITION - Building edition Paginating texts: .........Copying internal images... ./home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-CMSMcQ-report-1996-05-23.xml/Pictures ./home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures
->-- EDITION - Building edition Paginating texts: ......... Copying internal images... CMSMcQ-report-1996-05-23.xml/Pictures, Rapport.xml/Pictures...
Fail to copy /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures to /home/sheiden/TXM/corpora/ODTSAMPLE/HTML/ODTSAMPLE/default/Pictures
-> "Failed to copy /home/sheiden/TXM/corpora/ODTSAMPLE/docfiles/files-Rapport.xml/Pictures to /home/sheiden/TXM/corpora/ODTSAMPLE/HTML/ODTSAMPLE/default/Pictures"
Importation terminée : 20 sec (20162 ms)
Moteur de recherche lancé.
-> remove?
Moteur statistique lancé.connecté.
-> remove?
Chargement des sous-corpus et des partitions...Terminé.
TXM est prêt.
-> "Le corpus ODTSAMPLE est prêt."
Related issues
History
#1 Updated by Serge Heiden almost 8 years ago
- Description updated (diff)
#2 Updated by Matthieu Decorde almost 8 years ago
- % Done changed from 0 to 80
#3 Updated by Serge Heiden almost 8 years ago
- Description updated (diff)
- % Done changed from 80 to 70
#4 Updated by Matthieu Decorde almost 8 years ago
- Description updated (diff)
#5 Updated by Matthieu Decorde over 7 years ago
- % Done changed from 70 to 80