Feature #1633

TBX: 0.7.7, exclude ponctuation from Lexicon

Added by Serge Heiden over 3 years ago. Updated 4 months ago.

Status:New Start date:12/28/2015
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Commands Spent time: -
Target version:TXM 0.8.1

Description

Currently, a Lexicon is built with all corpus occurrences (aka tokens) surface forms, that is with the [] query projected on the word property.

1. This includes ponctuations which are not very intuitive or useful by default and doesn't permit to compare easily TXM results with other software.

2. If ponctuations are to be distinguished from regular words, a new option could also distinguish grammatical words (or stop words) from plain words based on CQP queries tuned for each TreeTagger tagset (see https://groupes.renater.fr/wiki/txm-users/public/faq#comment_faire_un_index_sans_les_mots-outils_ou_grammaticaux for examples). The ['ponctuation' < 'grammatical word' < 'plain word'] order relation would add a new progressive 'form' -> 'content' axis to the Lexicon command related to discourse analysis.

Solution 1

Build the Lexicon with the [word!="\p{P}"] query by default.

Solution 2

Add a new "excludeStopWords" option to the Lexicon command (which is "true" by default) which uses:

Discussion

The Lexicon command may be related to optimisation to give efficiently and rapidly a response to the default words frequency list need compared to the Index command, especially for big corpora.

If the Lexicon has to be filtered with various options, as the Index has to be by nature, we may have to add a new option to the Lexicon command related to optimisation. If this option is set, some filtering options may not be operant by default to give priority to efficiency. This option may be set to 'true' by default based on corpora properties (eg size).

Discussion 2

Index&Lexicon are so close in term of functionality we could simplify the interface by merging the 2 commands in one (named Lexicon or Index) with default query=[word!="\p{P}"]

History

#1 Updated by Sebastien Jacquot 12 months ago

  • Target version changed from TXM 0.8.0a (split/restructuration) to TXM 0.8.0

#2 Updated by Matthieu Decorde 4 months ago

  • Description updated (diff)
  • Target version changed from TXM 0.8.0 to TXM 0.8.1

Also available in: Atom PDF