Bug #1179

Updated by Matthieu Decorde over 4 years ago

In XML format sources, when a word is pre-encoded with a <w>...</w> tag
and the word form contains an end of line, the resulting word form is incorrect
because the end of line is just removed from the graphic form.

For example: <w>parce
que</w> gives 'parceque' word form, instead of 'parce que'.

h3. Solution 1

Replace any 'new-line' and 'tabulation' by 'space' character at tokenization level.

MD: 80% -> 70% must check if the unicode class are used

h3. Solution 2

Replace any white character as defined by Java by a 'space' character.

Java white characters are defined by the "isWhitespace method":(http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-)