Task #1630: TBX: improve performances of import process - Plateforme TXM - Forge du Centre Blaise Pascal

Task #1630

Mis à jour par Sebastien Jacquot il y a presque 10 ans

There may be some improvements to do in the Groovy code in some import sections.
Here is a list of proposals.

h3. Buffer management

* testing whether StringBuilder (non thread-safe) is more efficient than StringBuffer that we used at this moment.
* other tests:
* not reinstantiating the buffer (new StringBuffer) each parser event but reuse same buffer by clearing it (e.g.: with setLength() or delete()).
* it seems all the write methods of the XML writer (writeCharacters(), etc.) directly write in the output file, maybe there is a way to buffering/caching the content and do not access the file each StAX event (it also would be more safety because the file would be created only if all process has been successful rather than creating a partial invalid file). For example using a BufferedOutputStream while creating the files may cause less disk access and potentially be faster.

These tests have not returned significant difference, but the tests have been done on a small corpora and only in the tokenization section.

h3. Compiling REGEX patterns

* compile all REGEX patterns that are used in all import process in Groovy scripts (use "= ~")
* e.g. replace from: reg3pts = /\A(.*)(\.\.\.)(.*)\Z/ to: reg3pts = ~/\A(.*)(\.\.\.)(.*)\Z/
* also compile the pattern used in replaceAll(), split(), etc. then call these methods from the Matcher class itself rather than the String class
* see Pattern.compile() for Java code sections

These tests have returned very significant difference (only tested on a small corpora and in the tokenization section).

h3. Groovy usage

* it seems CPU time usage of Groovy runtime processing may be noticeable

Need more tests. (personally I don't think [SJ] all the import sections should be scripted in Groovy, only some that we want to expose to users)

h3. Reorganizing some process

* each step of import are sequentially executed, for example it seems the files are parsed twice when injecting metadata from a csv file (post-injecting in XML TXM-TEI file)
* It makes some more robust and readable code but less efficient, a way to improve this could be to store a list of SAX/StAX/DOM element handlers and give it to a common parser, e.g. when parser encounters a special element call all the SAX/StAX//DOM handlers stack event methods so more than one method is applied. But the performance gain is probably not significant in relation to the complexity of the generated code (and potential conflicts resolution), to discuss...

Retour

Laboratoire ICAR » Plateforme TXM

Task #1630