Feature #3373
Mis à jour par Serge Heiden il y a plus de 2 ans
Help to sample a corpus at:
a)
* import
** cut texts at n first words after tokenization
*** add 'Sampling/Échantillonnage' 'Sampling' section in import parameters form
*** add 'Sample texts to [ ] first words' parameter
*** add 'Cut at sentence boundaries (inclusive)' boundaries' option parameter
or
b)
* update
** add new corpus command 'Sample texts at n first words' (on XML-TXM pivot)
*** add 'Number of words' parameter
*** add 'Cut at sentence boundaries (inclusive)' boundaries' option parameter
*** update corpus
or
c)
* update
** add new corpus command 'Sample texts from sub-corpus' (on XML-TXM pivot from sub-corpus matches)
*** for example with sub-corpus built with query -@<text> @<text> []{1,10000}@ and MatchingStrategy set at 'longest'- 'longest'
*** update corpus
h3. Solution
Create the corpus/TruncateTextsAtFirstWords annotation/SampleWords macro to sample the xml-txm files of a TXM corpus with one parameter : number of words to keep per text
a)
* import
** cut texts at n first words after tokenization
*** add 'Sampling/Échantillonnage' 'Sampling' section in import parameters form
*** add 'Sample texts to [ ] first words' parameter
*** add 'Cut at sentence boundaries (inclusive)' boundaries' option parameter
or
b)
* update
** add new corpus command 'Sample texts at n first words' (on XML-TXM pivot)
*** add 'Number of words' parameter
*** add 'Cut at sentence boundaries (inclusive)' boundaries' option parameter
*** update corpus
or
c)
* update
** add new corpus command 'Sample texts from sub-corpus' (on XML-TXM pivot from sub-corpus matches)
*** for example with sub-corpus built with query -@<text> @<text> []{1,10000}@ and MatchingStrategy set at 'longest'- 'longest'
*** update corpus
h3. Solution
Create the corpus/TruncateTextsAtFirstWords annotation/SampleWords macro to sample the xml-txm files of a TXM corpus with one parameter : number of words to keep per text