Feature #3373: Macro, Corpus, sample texts to n first words - Plateforme TXM - Forge du Centre Blaise Pascal

Feature #3373

Macro, Corpus, sample texts to n first words

Ajouté par Matthieu Decorde il y a plus de 2 ans. Mis à jour il y a presque 2 ans.

Statut:

Closed

Début:

14/03/2023

Priorité:

Normal

Echéance:

Assigné à:

-

% réalisé:

100%

Catégorie:

Corpus

Temps passé:

-

Version cible:

Description

Help to sample a corpus at:

a)

import
- cut texts at n first words after tokenization
  - add 'Sampling/Échantillonnage' section in import parameters form
  - add 'Sample texts to [ ] first words' parameter
  - add 'Cut at sentence boundaries (inclusive)' option parameter

or

b)

update
- add new corpus command 'Sample texts at n first words' (on XML-TXM pivot)
  - add 'Number of words' parameter
  - add 'Cut at sentence boundaries (inclusive)' option parameter
  - update corpus

or

c)

update
- add new corpus command 'Sample texts from sub-corpus' (on XML-TXM pivot from sub-corpus matches)
  - for example with sub-corpus built with query ~~<text> []{1,10000} and MatchingStrategy set at 'longest'~~
  - update corpus

Solution¶

Create the corpus/TruncateTextsAtFirstWords macro to sample the xml-txm files of a TXM corpus with one parameter : number of words to keep per text

Historique

#1 Mis à jour par Serge Heiden il y a plus de 2 ans

Description mis à jour (diff)

#2 Mis à jour par Sebastien Jacquot il y a presque 2 ans

% réalisé changé de 80 à 100

#3 Mis à jour par Sebastien Jacquot il y a presque 2 ans

Statut changé de New à Closed

Formats disponibles : Atom PDF