Task #1630

TBX: improve performances of import process

Added by Sebastien Jacquot almost 4 years ago. Updated 9 months ago.

Status:New Start date:02/10/2016
Priority:Normal Due date:
Assignee:- % Done:

80%

Category:Development Spent time: -
Target version:TXM X.X

Description

There may be some improvements to do in the Groovy code in some import sections.
Here is a list of proposals.

Buffer management

  • testing whether StringBuilder (non thread-safe) is more efficient than StringBuffer that we used at this moment.
  • other tests:
  • not reinstantiating the buffer (new StringBuffer) each parser event but reuse same buffer by clearing it (e.g.: with setLength() or delete()).
  • it seems all the write methods of the XML writer (writeCharacters(), etc.) directly write in the output file, maybe there is a way to buffering/caching the content and do not access the file each StAX event (it also would be more safety because the file would be created only if all process has been successful rather than creating a partial invalid file). For example using a BufferedOutputStream while creating the files may cause less disk access and potentially be faster (see: http://tutorials.jenkov.com/java-io/bufferedoutputstream.html).

These tests have not returned significant difference, but the tests have been done on a small corpora and only in the tokenization section.

Compiling REGEX patterns

  • compile all REGEX patterns that are used in all import process in Groovy scripts (use "= ~")
  • e.g. replace from: reg3pts = /\A(.*)(\.\.\.)(.*)\Z/ to: reg3pts = ~/\A(.*)(\.\.\.)(.*)\Z/
  • also compile the pattern used in replaceAll(), split(), etc. then call these methods from the Matcher class itself rather than the String class
  • see Pattern.compile() for Java code sections

These tests have returned very significant difference (only tested on a small corpora and in the tokenization section).

Groovy usage

  • it seems CPU time usage of Groovy runtime processing may be noticeable

Need more tests. (personally I don't think [SJ] all the import sections should be scripted in Groovy, only some that we want to expose to users)

Reorganizing some process

  • each step of import are sequentially executed, for example it seems the files are parsed twice when injecting metadata from a csv file (post-injecting in XML TXM-TEI file)
  • It makes some more robust and readable code but less efficient, a way to improve this could be to store a list of SAX/StAX/DOM element handlers and give it to a common parser, e.g. when parser encounters a special element call all the SAX/StAX//DOM handlers stack event methods so more than one method is applied. But the performance gain is probably not significant in relation to the complexity of the generated code (and potential conflicts resolution), to discuss...

SimpleTokenizerXml.groovy (17.7 kB) Sebastien Jacquot, 12/17/2015 02:30 pm


Subtasks

Task #1666: TBX: improve performances of tokenizing processNew

History

#1 Updated by Sebastien Jacquot almost 4 years ago

  • Description updated (diff)

#2 Updated by Sebastien Jacquot almost 4 years ago

  • Description updated (diff)

#3 Updated by Sebastien Jacquot almost 4 years ago

Some tests on the tokenization has been done with BufferedOutputStream and various buffer sizes but it doesn't seem to significantly increase the speed of process.
With: new BufferedOutputStream(new FileOutputStream(outfile), size)
on a corpora of 1215 files. With = 2 minutes 55 / Without = 3 minutes
The tests are done from a TXM launched from Eclipse and with all logs activated which explains these "long" durations.
(for information by compiling the REGEX, the tokenization process approaches 1 minute 45 sec with same environment and corpora)

#4 Updated by Sebastien Jacquot almost 4 years ago

When monitoring the import process we can see that the cost of Groovy runtime reflection is significant. I guess that's due of the dynamic class/methods names and types resolving. Thus it needs more tests, I suggest to use @CompileStatic for the import sections classes and method when we can, to avoid dynamic reflection.
By testing this with previous corpora/environment the tokenization process duration reaches about 1 minute 20.
See:

import groovy.transform.CompileStatic
import static groovy.transform.TypeCheckingMode.SKIP
@CompileStatic
@CompileStatic(SKIP)

-----------
Tests summary on the previous 1215 files corpora for the tokenization process duration (/org.txm.toolbox/src/groovy/filters/Tokeniser/SimpleTokenizerXml.groovy)

Default
3 min
Using Pattern compilation for matches
1 min 48
Using Pattern compilation also for split() and replaceAll()
1 min 43
Using Groovy static compilation except for standardChecks() method (need to be modified to be statically compiled)
1 min 30
Using Groovy static compilation with standardChecks() method modified to use Matcher for static compilation type definition
1 min 20
-----------

#5 Updated by Sebastien Jacquot almost 4 years ago

Here is the Groovy used for the test.

#6 Updated by Sebastien Jacquot almost 4 years ago

About the BufferedOutputStream usage tests, they should be done again on a machine without SSD hard disk (mine got one).

#7 Updated by Sebastien Jacquot over 1 year ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#8 Updated by Sebastien Jacquot 9 months ago

  • Target version changed from TXM 0.8.0 to TXM X.X

Also available in: Atom PDF