## Feature #2365

### TXM: X.X, word level tags management in XML-TEI imports and TXM commands

Status:New Start date:04/11/2018
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

From Ciarán Ó Duibhín textometrie list mail

I'll try to compare the KWIC concordances produced by Xaira and TXM, but I am no expert on either program and I may make errors.  I recognize also that there is much more to TXM than KWIC concordances, and it has a HTML context display which may be customized to avoid some of my concerns with its KWIC display. Xaira seems confined to the Windows platform, and ceased development in 2010.

My two main concerns are with "non-lexical characters" (P5 guidelines page 575) and "non-original spaces".  These may seem inconsequential matters in many areas of study, but they are central when the focus is on language.

Non-lexical characters are those which should be dropped in indexing but retained in displayed contexts. TEI recommends using the <c> element, but doesn't (as far as I can see) distinguish them from other uses of <c>, so I'll suggest <c type="nonlex">.  My standard example is
<w>b<c type="nonlex">h</c>ean</w>
which would be indexed under "bean" and displayed as "bhean".  Neither Xaira nor TXM supports non-lex chars explicitly.  As a work-around (easy enough to do), both require the input to contain the form without the non-lex chars as an attribute of the form with the non-lex chars:
<w demut="bean">bhean
I would have hoped that Xaira, with another eight years of development, would have got around to handling such aspects of TEI linguistic markup more directly.

Non-original spaces are inter-token spaces inserted into the KWIC context in places where there was no such space in the original text.  Xaira does rather well at avoiding these — but it *does* turn line-ends into spaces, so you have to take new lines in your text only where a space is acceptable.  This requires some unwelcome manipulation of tokens which are broken at end-of-line, eg. (KWIC context should show "droch-bhláth")
<w>droch</w><w>-</w><!-- no space --><lb/>
<w>bhláth</w> <w>ar</w>
must be input as
<w>droch</w><w>-</w><!-- no space --><lb/><w>bhláth</w>
<w>ar</w>
and (KWIC context should show "geimhreadh")
<w>geimh-<!-- no space --><pb n="8"/><lb break="no"/>
<w>geimh<!-- no space --><pb n="8"/><lb break="no"/>readh</w>
<w>comh</w>
In the latter case the end-of-line hyphen is not wanted in the KWIC context, and to ensure this behaviour I have had to drop it from the input text rather than somehow marking it — perhaps as <c type="discard">-</c>.  But it should not be discarded whenever the text is being shown made up in the original pages and lines. The break="no" attribute on <lb> had no effect.

With TXM, production of the KWIC concordance is delegated to CWB/CQP, which inserts a space into KWIC contexts in many inter-token positions where the original text had no space.  As far as I know, TXM does not allow the user to edit the KWIC lines coming from CQP before displaying them.  The main differences I have noticed from Xaira are:
Unwanted spaces appear in TXM but not in Xaira:
• beside punctuation “ and ” (though not beside most other punctuation)
• between <w> elements, eg.
<w>a</w><!-- no space --><w>tá</w>
or
<w>droch</w><!-- no space --><w>-</w><!-- no space --><lb/><w>bhláth</w>
Unwanted spaces appear in Xaira but not in TXM:
• in a word broken at end-of-line:
<w>geimh-<pb n="8"/><lb break="no"/>
There is no need in TXM to move the second part of the broken word back to the previous line.
Similar to Xaira:
• to get rid of the hyphen from the context, the last example must be input as
<w>geimh<pb n="8"/><lb break="no"/>