Bug #1179
Updated by Matthieu Decorde over 4 years ago
In XML format sources, when a word is pre-encoded with a <w>...</w> tag
and the word form contains an end of line, the resulting word form is incorrect
because the end of line is just removed from the graphic form.
For example: <w>parce
que</w> gives 'parceque' word form, instead of 'parce que'.
h3. Solution 1
Replace any 'new-line' and 'tabulation' by 'space' character at tokenization level.
MD: 80% -> 70% must check if the unicode class are used
h3. Solution 2
Replace any white character as defined by Java by a 'space' character.
Java white characters are defined by the "isWhitespace method":(http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-)
and the word form contains an end of line, the resulting word form is incorrect
because the end of line is just removed from the graphic form.
For example: <w>parce
que</w> gives 'parceque' word form, instead of 'parce que'.
h3. Solution 1
Replace any 'new-line' and 'tabulation' by 'space' character at tokenization level.
MD: 80% -> 70% must check if the unicode class are used
h3. Solution 2
Replace any white character as defined by Java by a 'space' character.
Java white characters are defined by the "isWhitespace method":(http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isWhitespace-int-)