Bug #763

RCP: 0.7.5: Fix concordance export memory usage

Added by Matthieu Decorde over 5 years ago. Updated almost 5 years ago.

Status:Closed Start date:04/24/2014
Priority:Normal Due date:
Assignee:- % Done:

100%

Category:Commands Spent time: -
Target version:TXM 0.7.6

Description

The current Concordance export loads all the concordance Lines in memory before writing them into the CSV file which leads to memory exhaustion for voluminous concordance results.

Solution

A solution is to write the concordance lines per packet, letting early packets to be garbage collected to keep memory consumption level.

2 development steps :
A- Create a macro (to provide a rapid answer)
B- Fix Concordance.toTxt(...) in sources and produce an update

Recette macro (FR)

- Télécharger l'archive de la macro (pièce jointe)
- copier le dossier "export" que l'archive contient dans le dossier des macros de TXM ($TXMHOME/scripts/macro (et non "macros" !)).
- faire une concordance de "[]" sur BROWN
- lancer la macro sur la concordance
- vérifier que le nombre de lignes (wc -l) est le nombre de résultat + 1 (l'entête)

Recette maj (FR)

- ...
- faire une concordance de "[]" sur BROWN
- lancer la macro sur la concordance
- vérifier que le nombre de lignes (wc -l) est le nombre de résultat + 1 (l'entête)

History

#1 Updated by Matthieu Decorde over 5 years ago

Packet size will be set to 5000.

R code to visualize the experimental export time graph (in milliseconds) in terms of packet size (line), for a concordance of 900k lines on a standard Linux workstation:

size <- c(10,100,1000,5000,10000,50000,100000)
time <- c(57152,50090,49592,49525,50394,53320,58106)
plot(size, time,type="p")

#2 Updated by Matthieu Decorde over 5 years ago

  • Description updated (diff)

#3 Updated by Serge Heiden over 5 years ago

  • Description updated (diff)

#4 Updated by Matthieu Decorde over 5 years ago

  • % Done changed from 0 to 50

A macro fixing the bug is currenlty under test

#5 Updated by Alexey Lavrentev over 5 years ago

  • Description updated (diff)

#6 Updated by Sebastien Jacquot over 5 years ago

It worked very well with the "Brown" corpora.
Number of lines in concordance: 1 161 028
Number of lines in .tsv file: 1 161 029
Elapsed time: 106621 ms

#7 Updated by Matthieu Decorde over 5 years ago

  • % Done changed from 50 to 80

#8 Updated by Sebastien Jacquot about 5 years ago

  • % Done changed from 80 to 90

#9 Updated by Matthieu Decorde almost 5 years ago

  • % Done changed from 90 to 100

#10 Updated by Matthieu Decorde almost 5 years ago

  • Status changed from New to Closed

Also available in: Atom PDF