Task #1630

TBX: improve performances of import process

Ajouté par Sebastien Jacquot il y a presque 3 ans. Mis à jour il y a 2 mois.

Statut:New Début:10/02/2016
Priorité:Normal Echéance:
Assigné à:- % réalisé:

80%

Catégorie:Development Temps passé: -
Version cible:TXM 0.8.0

Description

There may be some improvements to do in the Groovy code in some import sections.
Here is a list of proposals.

Buffer management

  • testing whether StringBuilder (non thread-safe) is more efficient than StringBuffer that we used at this moment.
  • other tests:
  • not reinstantiating the buffer (new StringBuffer) each parser event but reuse same buffer by clearing it (e.g.: with setLength() or delete()).
  • it seems all the write methods of the XML writer (writeCharacters(), etc.) directly write in the output file, maybe there is a way to buffering/caching the content and do not access the file each StAX event (it also would be more safety because the file would be created only if all process has been successful rather than creating a partial invalid file). For example using a BufferedOutputStream while creating the files may cause less disk access and potentially be faster (see: http://tutorials.jenkov.com/java-io/bufferedoutputstream.html).

These tests have not returned significant difference, but the tests have been done on a small corpora and only in the tokenization section.

Compiling REGEX patterns

  • compile all REGEX patterns that are used in all import process in Groovy scripts (use "= ~")
  • e.g. replace from: reg3pts = /\A(.*)(\.\.\.)(.*)\Z/ to: reg3pts = ~/\A(.*)(\.\.\.)(.*)\Z/
  • also compile the pattern used in replaceAll(), split(), etc. then call these methods from the Matcher class itself rather than the String class
  • see Pattern.compile() for Java code sections

These tests have returned very significant difference (only tested on a small corpora and in the tokenization section).

Groovy usage

  • it seems CPU time usage of Groovy runtime processing may be noticeable

Need more tests. (personally I don't think [SJ] all the import sections should be scripted in Groovy, only some that we want to expose to users)

Reorganizing some process

  • each step of import are sequentially executed, for example it seems the files are parsed twice when injecting metadata from a csv file (post-injecting in XML TXM-TEI file)
  • It makes some more robust and readable code but less efficient, a way to improve this could be to store a list of SAX/StAX/DOM element handlers and give it to a common parser, e.g. when parser encounters a special element call all the SAX/StAX//DOM handlers stack event methods so more than one method is applied. But the performance gain is probably not significant in relation to the complexity of the generated code (and potential conflicts resolution), to discuss...

SimpleTokenizerXml.groovy (17,67 ko) Sebastien Jacquot, 17/12/2015 14:30


Sous-tâches

Task #1666: TBX: improve performances of tokenizing processNew

Historique

#1 Mis à jour par Sebastien Jacquot il y a presque 3 ans

  • Description mis à jour (diff)

#2 Mis à jour par Sebastien Jacquot il y a presque 3 ans

  • Description mis à jour (diff)

#3 Mis à jour par Sebastien Jacquot il y a presque 3 ans

Some tests on the tokenization has been done with BufferedOutputStream and various buffer sizes but it doesn't seem to significantly increase the speed of process.
With: new BufferedOutputStream(new FileOutputStream(outfile), size)
on a corpora of 1215 files. With = 2 minutes 55 / Without = 3 minutes
The tests are done from a TXM launched from Eclipse and with all logs activated which explains these "long" durations.
(for information by compiling the REGEX, the tokenization process approaches 1 minute 45 sec with same environment and corpora)

#4 Mis à jour par Sebastien Jacquot il y a presque 3 ans

When monitoring the import process we can see that the cost of Groovy runtime reflection is significant. I guess that's due of the dynamic class/methods names and types resolving. Thus it needs more tests, I suggest to use @CompileStatic for the import sections classes and method when we can, to avoid dynamic reflection.
By testing this with previous corpora/environment the tokenization process duration reaches about 1 minute 20.
See:

import groovy.transform.CompileStatic
import static groovy.transform.TypeCheckingMode.SKIP
@CompileStatic
@CompileStatic(SKIP)

-----------
Tests summary on the previous 1215 files corpora for the tokenization process duration (/org.txm.toolbox/src/groovy/filters/Tokeniser/SimpleTokenizerXml.groovy)

Default
3 min
Using Pattern compilation for matches
1 min 48
Using Pattern compilation also for split() and replaceAll()
1 min 43
Using Groovy static compilation except for standardChecks() method (need to be modified to be statically compiled)
1 min 30
Using Groovy static compilation with standardChecks() method modified to use Matcher for static compilation type definition
1 min 20
-----------

#5 Mis à jour par Sebastien Jacquot il y a presque 3 ans

Here is the Groovy used for the test.

#6 Mis à jour par Sebastien Jacquot il y a presque 3 ans

About the BufferedOutputStream usage tests, they should be done again on a machine without SSD hard disk (mine got one).

#7 Mis à jour par Sebastien Jacquot il y a 2 mois

  • Version cible changé de TXM 0.8.0a (split/restructuration) à TXM 0.8.0

Formats disponibles : Atom PDF