Statistics
| Revision:

root / tmp / org.txm.treetagger.core.macosx / res / macosx / README @ 1120

History | View | Annotate | Download (8.3 kB)

1

    
2
/***************************************************************************/
3
/* How to use the TreeTagger                                               */
4
/* Author: Helmut Schmid, University of Stuttgart, Germany                 */
5
/***************************************************************************/
6

    
7

    
8
The TreeTagger consists of two programs: train-tree-tagger is used to 
9
create a parameter file from a lexicon and a handtagged corpus. 
10
tree-tagger expects a parameter file and a text file as arguments and
11
annotates the text with part-of-speech tags. The file formats are
12
described below. By default, the programs are located in the ./bin
13
sub-directory.
14

    
15
If either of the programs is called without arguments, it will print
16
information about its usage.
17

    
18

    
19
Tagging
20
-------
21

    
22
Tagging is done with the tree-tagger program. It requires at least one
23
command line argument, the parameter file. If no input file is specified,
24
input will be read from stdin. If neither an input file nor an output file
25
is specified, the tagger will print to stdout.
26

    
27
tree-tagger {-options-} <parameter file> {<input file> {<output file>}}
28

    
29
Description of the command line arguments:
30

    
31
* <parameter file>: Name of a parameter file which was created with the 
32
  train-tree-tagger program.
33
* <input file>: Name of the file which is to be tagged. Each token in this 
34
  file has to be on a separate line. Tokens may contain blanks. It is possible
35
  to override the lexical information contained in the parameter file of the
36
  tagger by specifying a list of possible tags after a token. This list has
37
  to be preceded by a tab character and the elements are separated by tab 
38
  characters. This pretagging feature could be used e.g. to ensure that
39
  certain text-specific expressions are tagged properly.
40
  Punctuation marks must be on separate lines as well. Clitics (like "'s",
41
  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be
42
  separated if they were separated in the training data. (The French and
43
  English parameter files available by ftp expect separation of clitics).
44
  Sample input file:
45
    He
46
    moved
47
    to
48
    New York City	NP
49
    .
50
* <output file>: Name of the file to which the tagger should write its output.
51

    
52
Further optional command line arguments:
53

    
54
* -token: tells the tagger to print the words also.
55
* -lemma: tells the tagger to print the lemmas of the words also.
56
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
57
  with '>' (SGML tags).
58
- -no-unknown: If an unknown word is encountered, emit the word form
59
  as lemma. This was previously the default behaviour. Now, the default 
60
  behaviour is to print "<unknown>" as lemma.
61
- -threshold <p>: This option tells the tagger to print all tags of a
62
  word with a probability higher than <p> times the largest probability.
63
  (The tagger will use a different algorithm in this case and the set of
64
  best tags might be different from the tags generated without this
65
  option.)
66
- -prob: Print tag probabilities (in combination with option -threshold)
67
- -pt-with-prob: If this option is specified, then each pretagging tag
68
  (see above) has to be followed by a whitespace and a tag probability 
69
  value.
70
- -pt-with-lemma: If this option is specified, then each pretagging tag
71
  (see above) has to be followed by a whitespace and a lemma. Lemmas may 
72
  contain blanks.
73
  If both -pt-with-prob and -pt-with-lemma have been specified, then each
74
  pretagging tag is followed by a probability and a lemma in that order.
75

    
76
The options below are for advanced users. Please, read the papers on the 
77
TreeTagger to fully understand their meaning.
78

    
79
* -proto: If this option is specified, the tagger creates a file named
80
  "lexicon-protocol.txt", which contains information about the degree of
81
  ambiguity and about the other possible tags of a word form. The part of
82
  the lexicon in which the word form has been found is also indicated. 'f'
83
  means fullform lexicon and 's' means affix lexicon. 'h' means that the
84
  word contains a hyphen and that the part of the word following the
85
  hyphen has been found in the fullform lexicon.
86
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
87
  This is the case if a word/tag pair is contained in the lexicon but not
88
  in the training corpus. The choice of this parameter has only minor
89
  influence on the tagging accuracy.
90
* -base: If this option is specified, only lexical information is used
91
  for tagging but no contextual information about the preceding tags.
92
  This option is only useful in order to obtain a baseline result
93
  to which to compare the actual tagger output.
94

    
95

    
96

    
97
Training
98
--------
99

    
100
Training is done with the *train-tree-tagger* program. It expects at least
101
four command line arguments which are described below.
102

    
103
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>
104

    
105
Description of the command line arguments:
106

    
107
* <lexicon>: name of a file which contains the fullform lexicon. Each line 
108
  of the lexicon corresponds to one word form and contains the word form 
109
  and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
110
  and each lemma is preceded by a blank or tab character.
111
  Example:
112

    
113
aback	RB aback
114
abacuses	NNS abacus
115
abandon	VB abandon	VBP abandon
116
abandoned	JJ abandoned	VBD abandon	VBN abandon
117
abandoning	VBG abandon
118

    
119
  Attention: Ordinal and cardinal numbers which consist of digits
120
  (like 1, 13, 1278 or 2. and 75.) should not be included in the
121
  lexicon. Otherwise, the tagger will not be able to learn how to tag
122
  numbers which are not listed in the lexicon. Numbers with unusual
123
  tags should be added to the lexicon, however. If the training
124
  program reports an error because the POS tag used for numbers is
125
  unknown, you should add a lexicon entry for one number.
126

    
127
  Remark: The tagger doesn't need the lemmata for tagging actually. If
128
  you do not have the lemma information or if you do not plan to
129
  annotate corpora with lemmas, you can replace the lemma with a dummy
130
  value, e.g. "-".
131

    
132
* <open class file>: name of a file which contains a list of open class tags
133
  i.e. possible tags of unknown word forms separated by whitespace.
134
  The tagger will use this information when it encounters unknown words,
135
  i.e. words which are not contained in the lexicon.
136
  Example: (for Penn Treebank tagset)
137

    
138
FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ
139

    
140
* <input file>: name of a file which contains tagged training data. The data
141
  must be in one-word-per-line format. This means that each line contains 
142
  one token and one tag in that order separated by a tabulator. 
143
  Punctuation marks are considered as tokens and must be tagged as well.
144
  The file should neither contain empty lines nor untagged SGML markup.
145
  Example:
146

    
147
Pierre  NP
148
Vinken  NP
149
,       ,
150
61      CD
151
years   NNS
152

    
153
* <output file>: name of the file in which the resulting tagger parameters 
154
  are stored.
155

    
156
The following parameters are optional. Read the papers on the TreeTagger to 
157
fully understand their meaning.
158

    
159
* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
160
  is assigned to sentence punctuation like ".", "!", "?". 
161
  Default is "SENT". It is important to set this option properly, if your
162
  tag for sentence punctuation is not "SENT".
163
* -cl <context length>: number of preceding words forming the statistical
164
  context. The default is 2 which corresponds to a trigram context. For
165
  small training corpora and/or large tagsets, it could be useful to reduce
166
  this parameter to 1.
167
* -dtg <min. decision tree gain>: Threshold - If the information gain at a 
168
  leaf node of the decision tree is below this threshold, the node is deleted.
169
* -sw <weight>: A smoothing parameter, which determines how much the
170
  probability distribution of some decision tree node is smoothed with the
171
  probability distribution of the parent node.
172
* -ecw <eq. class weight>: weight of the equivalence class based probability
173
  estimates.
174
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
175
  affix tree is below this threshold, it is deleted. The default is 1.2.
176

    
177
The accuracy of the TreeTagger usually improves, if different settings
178
of the above parameters are tested and the best combination is chosen.
179

    
180

    
181
Caveat: Make sure that the lexicon and the training corpus contain no
182
extra blanks. If the word form, for instance, is followed by a blank
183
and a tab character, the blank will be considered part of the word.
184