Statistics
| Revision:

root / tmp / org.txm.treetagger.core.win32 / res / win32 / README.txt @ 1670

History | View | Annotate | Download (7.7 kB)

1

    
2
/****************************************************************************/
3
/* How to use the TreeTagger                                                */
4
/*                                                                          */
5
/* Author: Helmut Schmid, CIS, Ludwig-Maximilians-Universit├Ąt, Germany      */
6
/****************************************************************************/
7

    
8

    
9
The TreeTagger consists of two programs: the training program creates
10
a parameter file from a fullform lexicon and a handtagged corpus. The
11
tagger program reads the parameter file and annotates the text with
12
part of speech and lemma information. Both programs print information
13
about their usage when they are called without arguments.
14

    
15

    
16
Tagging
17
-------
18

    
19
Tagging is done with the *tree-tagger* program. 
20

    
21
The first argument is the name of a parameter file which was generated
22
with the train-tree-tagger program. Parameter files generated on
23
different platforms or with older versions of train-tree-tagger will
24
not work.
25

    
26
The second argument is the input file. It must be in one-word-per-line
27
format, i.e. each line contains one token (word, punctuation character
28
or parenthesis) and should not exceed 1000 characters. Tokens may contain
29
blanks. It is possible to override the lexical information contained
30
in the parameter file of the tagger by specifying a list of possible
31
tags after the token. This list has to be preceded by a tab character
32
and the elements are separated by tab characters. Pretagging could be
33
used e.g. to ensure that certain text-specific expressions are tagged
34
correctly. Clitics (like "'s", "'re", and "'d" in English or "-la" and
35
"-t-elle" in French) have to be separated if they were separated in
36
the training data. (The French and English parameter files available
37
by ftp expect separation of clitics).
38

    
39
Sample input file:
40
He
41
moved
42
to
43
New York City	NP
44
.
45

    
46

    
47
The third argument is the name of the output file. The output is also
48
in one-word-per-line format. Depending on the specified options, it
49
will contain columns with tokens, tags and lemmas. If the third
50
argument is missing, the output will be printed to standard output. If
51
the second argument is missing, too, input is read from standard
52
input.
53

    
54
Options:
55

    
56
-token: Prints the token as well.
57
-lemma: Prints the lemma as well.
58
-sgml:  Don't tag SGML annotations, i.e. lines starting with '<' and ending
59
        with '>'.
60
-threshold <p>: Print all tags with a probability higher than <p> times the
61
        probability of the best tag.
62
-prob:  Print tag probabilities (requires option -threshold)
63
-no-unknown: Print the token rather than <unknown> for unknown lemmas
64
-quiet: Don't print status messages
65
-pt-with-lemma: If this option is specified, then each pretagging tag
66
        (see above) has to be followed by a whitespace and a lemma.
67
-pt-with-prob: If this option is specified, then each pretagging tag
68
        (see above) has to be followed by whitespace and a tag probability
69
        value. If -pt-with-prob and -pt-with-lemma have been specified,
70
        then each pretagging tag is followed by a probability and a lemma
71
        in that order.
72
-files f: Read the names of input and output files pairwise from the
73
        file f. The format of f is the lexicon file format described below.
74
-lex f: Read auxiliary lexicon entries from the file f.
75
-eos-tag <tag>: The SGML tag <tag> signals the end of a sentence.
76
        This option implies the option -sgml
77

    
78
Some more exotic options:
79
-proto: Print lexical information for each word
80
  The lexicon type is signalled by one of the characters
81
  f: The word was found in the full form lexicon.
82
  c: The word in lowercase was found in the lexicon
83
  h: The word contains an hyphen and the word following the hyphen was found
84
     in the full form lexicon; e.g. instead of "table-wine" only "wine" has
85
     been found.
86
  s: The word has been looked up in the suffix lexicon
87
  p: Tags have been assigned by pretagging.
88
-gramotron: Same as -proto but with a different format
89
-proto-with-prob: Same as -proto but with lexical tag probabilities
90
-print-prob-tree: Print the transition probability tree and exit
91
-eps <epsilon>: Value which is used to replace zero lexical frequencies.
92
  Zero frequencies occur when a word/tag pair is contained in the lexicon
93
  but not in the training corpus. The default is 0.1.
94
-base:  Use only lexical probabilities for tagging. This option is only
95
  useful to obtain a baseline result to which the actual tagger output is
96
  compared.
97

    
98

    
99

    
100
Training
101
--------
102

    
103
Training is done with the *train-tree-tagger* program. If the program is 
104
called without arguments, the following output is printed:
105

    
106
USAGE: train-tree-tagger <lexicon> <open class file> <infile> <outfile> 
107
       {-cl <context length>} {-dtg <min. decision tree gain>}
108
       {-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}
109

    
110
Description of the command line arguments:
111
* <lexicon>: name of a file which contains the fullform lexicon. Each line 
112
  of the lexicon corresponds to one word form and contains the word form 
113
  itself followed by a Tab character and a sequence of tag-lemma pairs.
114
  The tags and lemmata are separated by whitespace.
115

    
116
Example:
117
aback	RB aback
118
abacuses	NNS abacus
119
abandon	VB abandon	VBP abandon
120
abandoned	JJ abandoned	VBD abandon	VBN abandon
121
abandoning	VBG abandon
122

    
123
  Important: Ordinal and cardinal numbers which consist of digits
124
  should not be included in the lexicon. Otherwise, the tagger will
125
  not be able to learn how to tag numbers which are not listed in the
126
  lexicon. Numbers with unusual tags should be added to the lexicon,
127
  however.
128

    
129
  Remark: The tagger doesn't need the lemmata for tagging. If
130
  you do not have the lemma information or if you do not plan to
131
  annotate corpora with lemmas, you can replace the lemma with a dummy
132
  value, e.g. "-".
133

    
134
* <open class file>: name of a file which contains a list of open class tags
135
  i.e. possible tags of unknown word forms. This information is needed to
136
  estimate likely tags of unknown words. This file would typically contain
137
  adverb, adjective, noun, proper name and perhaps verb tags, but not
138
  prepositions, determiners, pronouns or numbers.
139
* <input file>: name of a file which contains tagged training data. The data
140
  must be in one-word-per-line format. This means that each line contains 
141
  one token and one tag in that order separated by a tabulator. 
142
  Punctuation marks are considered as tokens and must have been tagged as well.
143

    
144
Example:
145
Pierre	NP
146
Vinken	NP
147
,	,
148
61	CD
149
years	NNS
150

    
151
* <output file>: name of the file in which the resulting tagger parameters 
152
  are stored.
153

    
154

    
155
The following parameters are optional:
156

    
157
* -cl <context length>: number of preceding words forming the tagging
158
  context. The default is 2 which corresponds to a trigram context. For
159
  small training corpora and/or large tagsets, it could be useful to reduce
160
  this parameter to 1.
161
* -dtg <min. decision tree gain>: Threshold - If the information gain at a 
162
  leaf node of the decision tree is below this threshold, the node is deleted.
163
  The default value is 0.7.
164
* -ecw <eq. class weight>: weight of the equivalence class based probability
165
  estimates. The default is 0.15.
166
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
167
  affix tree is below this threshold, it is deleted. The default is 1.2.
168
* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
169
  is assigned to sentence punctuation like ".", "!", "?". 
170
  Default is "SENT". It is important to set this option properly, if your
171
  tag for sentence punctuation is not "SENT".
172

    
173
The accuracy of the TreeTagger usually improves a bit, if different
174
settings of the above parameters are tested and the best combination
175
is chosen.