Task #961

Updated by Matthieu Decorde about 5 years ago

See https://groupes.renater.fr/wiki/txm-users/public/projets_matrice_avec_txm#mise_en_oeuvre
MD:


*importing big corpus*
* This import prototype works with Limsi transcription files
** There is a kind king of tokenization that I use
** This first prototype takes the minimum data from Limsi files : words + text structures
* The steps are few to limits the number of files
* We don't produce XML files and use directly the CPQ corpus sources format (tabulated). Useful because TreeTagger use it too.
* We don't produce Editions
* The corpus is tagged with TreeTagger
** The next optimization is to run 1 instance of TreeTagger instead of 1 per text.
** longest step of the import
* to avoid redo steps, file timestamps are compared before doing an operation

*counting terms per document*
MD: I optimised the prototype of QueryIndex to work with "text" structures and add a new export method.
structures. The main optimisation is to count CQL per CQL the number of matches per Text. It is done quickly because matches and "text" structures are corpus ordered.

Back