Bug #2160
Mis à jour par Matthieu Decorde il y a plus de 4 ans
For some texts, words are not highlighted in editions.
The IDS of those words not highlighted contain characters that broke the CSS ID syntax rules (e.g " ", "(" and more)
h3. Discussion
Word IDs are built with <text identifier + number> or come from the sources.
If we forge the word ids in import modules, we must normalize/reduce text names to a text identifier, at the level of the corpus.
Three strategies:
* a) normalize/reduce characters or morphemes
* b) escape characters
* c) manage <text name>:<automatic text identifier> hash
b) suppose to escape with respect to the syntax reading the identifier: for example CSS syntax. So different escape algorithms may need to be used depending on context. See the XXX Java library to escape for a lot of different syntaxes.
c) suppose to use the hash in various contexts: eg concordance references, etc.
h3. Solution
Define the most simple common compatible syntax compatible with CSS ID syntax and CQL syntax.
Do a) fix the XMLw to XML-TXM step of import modules, in the XML2Ana class:
* normalize/reduce the text ID to the bottom syntax
h3. Solution 2 (not done, see #2364)
* add a new import option "force word id generation" for corpora having already word IDs.
* add a new load option "force word id generation" for corpora having already word IDs.
The IDS of those words not highlighted contain characters that broke the CSS ID syntax rules (e.g " ", "(" and more)
h3. Discussion
Word IDs are built with <text identifier + number> or come from the sources.
If we forge the word ids in import modules, we must normalize/reduce text names to a text identifier, at the level of the corpus.
Three strategies:
* a) normalize/reduce characters or morphemes
* b) escape characters
* c) manage <text name>:<automatic text identifier> hash
b) suppose to escape with respect to the syntax reading the identifier: for example CSS syntax. So different escape algorithms may need to be used depending on context. See the XXX Java library to escape for a lot of different syntaxes.
c) suppose to use the hash in various contexts: eg concordance references, etc.
h3. Solution
Define the most simple common compatible syntax compatible with CSS ID syntax and CQL syntax.
Do a) fix the XMLw to XML-TXM step of import modules, in the XML2Ana class:
* normalize/reduce the text ID to the bottom syntax
h3. Solution 2 (not done, see #2364)
* add a new import option "force word id generation" for corpora having already word IDs.
* add a new load option "force word id generation" for corpora having already word IDs.