1 1

2
/***************************************************************************/

3
/* How to use the TreeTagger                                               */

4
/* Author: Helmut Schmid, University of Stuttgart, Germany                 */

5
/***************************************************************************/

2
/*****************************************************************************/

3
/* How to use the TreeTagger                                                 */

4
/*****************************************************************************/

6 5

7 6

8 7
The TreeTagger consists of two programs: train-tree-tagger is used to

......
24 23
input will be read from stdin. If neither an input file nor an output file

25 24
is specified, the tagger will print to stdout.

26 25

27
tree-tagger {-options-} <parameter file> {<input file> {<output file>}}

26
tree-tagger <parameter file> <input file> <output file> {-eps <epsilon>}

27
       {-base} {-proto} {-sgml} {-token} {-lemma} {-beam <threshold>}

28 28

29 29
Description of the command line arguments:

30 30

31 31
* <parameter file>: Name of a parameter file which was created with the

32 32
  train-tree-tagger program.

33 33
* <input file>: Name of the file which is to be tagged. Each token in this

34
  file has to be on a separate line. Tokens may contain blanks. It is possible

34
  file must be on a separate line. Tokens may contain blanks. It is possible

35 35
  to override the lexical information contained in the parameter file of the

36 36
  tagger by specifying a list of possible tags after a token. This list has

37
  to be preceded by a tab character and the elements are separated by tab

38
  characters. This pretagging feature could be used e.g. to ensure that

37
  to be preceded by a tab character. The tags are optionally followed by a

38
  floating point value to specify the probability of the tag. Adding such

39
  tag information in the tagger's input is sometimes useful to ensure that

39 40
  certain text-specific expressions are tagged properly.

40 41
  Punctuation marks must be on separate lines as well. Clitics (like "'s",

41 42
  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be

42 43
  separated if they were separated in the training data. (The French and

43
  English parameter files available by ftp expect separation of clitics).

44
  English parameter files available by ftp, expect separation of clitics).

44 45
  Sample input file:

45 46
    He

46 47
    moved

47 48
    to

48
    New York City	NP

49
    New York City	NP 1.0

49 50
    .

50 51
* <output file>: Name of the file to which the tagger should write its output.

51 52

......
55 56
* -lemma: tells the tagger to print the lemmas of the words also.

56 57
* -sgml: tells the tagger to ignore tokens starting with '<' and ending

57 58
  with '>' (SGML tags).

58
- -no-unknown: If an unknown word is encountered, emit the word form

59
  as lemma. This was previously the default behaviour. Now, the default

60
  behaviour is to print "<unknown>" as lemma.

61
- -threshold <p>: This option tells the tagger to print all tags of a

62
  word with a probability higher than <p> times the largest probability.

63
  (The tagger will use a different algorithm in this case and the set of

64
  best tags might be different from the tags generated without this

65
  option.)

66
- -prob: Print tag probabilities (in combination with option -threshold)

67
- -pt-with-prob: If this option is specified, then each pretagging tag

68
  (see above) has to be followed by a whitespace and a tag probability

69
  value.

70
- -pt-with-lemma: If this option is specified, then each pretagging tag

71
  (see above) has to be followed by a whitespace and a lemma. Lemmas may

72
  contain blanks.

73
  If both -pt-with-prob and -pt-with-lemma have been specified, then each

74
  pretagging tag is followed by a probability and a lemma in that order.

75 59

76
The options below are for advanced users. Please, read the papers on the

77
TreeTagger to fully understand their meaning.

60
The options below are for advanced users. Read the papers on the TreeTagger

61
to fully understand their meaning.

78 62

79 63
* -proto: If this option is specified, the tagger creates a file named

80 64
  "lexicon-protocol.txt", which contains information about the degree of

......
85 69
  hyphen has been found in the fullform lexicon.

86 70
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.

87 71
  This is the case if a word/tag pair is contained in the lexicon but not

88
  in the training corpus. The choice of this parameter has only minor

89
  influence on the tagging accuracy.

72
  in the training corpus. The default is 0.1. The choice of this parameter

73
  has some minor influence on tagging accuracy.

74
* -beam <threshold>: If the tagger is slow, this option can be used to speed it up.

75
  Good values for <threshold> are in the range 0.001-0.00001.

90 76
* -base: If this option is specified, only lexical information is used

91 77
  for tagging but no contextual information about the preceding tags.

92 78
  This option is only useful in order to obtain a baseline result

93 79
  to which to compare the actual tagger output.

94 80

81
There is another tagger program called "tree-tagger-flush" which

82
flushes the output after reading an empty line. It expects a parameter

83
file as argument and reads from stdin and writes to stdout. No command

84
line options are supported. This program might be useful for

85
implementing wrappers.

95 86

96 87

88

97 89
Training

98 90
--------

99 91

100 92
Training is done with the *train-tree-tagger* program. It expects at least

101 93
four command line arguments which are described below.

102 94

103
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>

95
train-tree-tagger <lexicon> <open class file> <input file> <output file>

96
            {-cl <context length>} {-dtg <min. decision tree gain>}

97
            {-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}

104 98

105 99
Description of the command line arguments:

106 100

107 101
* <lexicon>: name of a file which contains the fullform lexicon. Each line

108 102
  of the lexicon corresponds to one word form and contains the word form

109
  and a sequence of tag-lemma pairs. Each tag is preceded by a tab character

110
  and each lemma is preceded by a blank or tab character.

103
  itself followed by a Tab character and a sequence of tag-lemma pairs.

104
  The tags and lemmata are separated by whitespace.

111 105
  Example:

112 106

113 107
aback	RB aback

114 108
abacuses	NNS abacus

115
abandon	VB abandon	VBP abandon

116
abandoned	JJ abandoned	VBD abandon	VBN abandon

109
abandon	VB abandon VBP abandon

110
abandoned	JJ abandoned VBD abandon VBN abandon

117 111
abandoning	VBG abandon

118 112

119
  Attention: Ordinal and cardinal numbers which consist of digits

120
  (like 1, 13, 1278 or 2. and 75.) should not be included in the

121
  lexicon. Otherwise, the tagger will not be able to learn how to tag

122
  numbers which are not listed in the lexicon. Numbers with unusual

123
  tags should be added to the lexicon, however. If the training

124
  program reports an error because the POS tag used for numbers is

125
  unknown, you should add a lexicon entry for one number.

113
  Remark: The tagger doesn't need the lemmata actually. If you do not have

114
  the lemma information or if you do not plan to annotate corpora with

115
  lemmas, you can replace the lemma with a dummy value, e.g. "-".

126 116

127
  Remark: The tagger doesn't need the lemmata for tagging actually. If

128
  you do not have the lemma information or if you do not plan to

129
  annotate corpora with lemmas, you can replace the lemma with a dummy

130
  value, e.g. "-".

131

132
* <open class file>: name of a file which contains a list of open class tags

133
  i.e. possible tags of unknown word forms separated by whitespace.

117
* <open class file>: name of a file which contains a list of open class

118
  tags i.e. possible tags of unknown word forms separated by whitespace.

134 119
  The tagger will use this information when it encounters unknown words,

135 120
  i.e. words which are not contained in the lexicon.

136 121
  Example: (for Penn Treebank tagset)

......
140 125
* <input file>: name of a file which contains tagged training data. The data

141 126
  must be in one-word-per-line format. This means that each line contains

142 127
  one token and one tag in that order separated by a tabulator.

143
  Punctuation marks are considered as tokens and must be tagged as well.

144
  The file should neither contain empty lines nor untagged SGML markup.

128
  Punctuation marks are considered as tokens and must have been tagged as well.

145 129
  Example:

146 130

147 131
Pierre  NP

......
166 150
  this parameter to 1.

167 151
* -dtg <min. decision tree gain>: Threshold - If the information gain at a

168 152
  leaf node of the decision tree is below this threshold, the node is deleted.

169
* -sw <weight>: A smoothing parameter, which determines how much the

170
  probability distribution of some decision tree node is smoothed with the

171
  probability distribution of the parent node.

153
  The default value is 0.7.

172 154
* -ecw <eq. class weight>: weight of the equivalence class based probability

173
  estimates.

155
  estimates. The default is 0.15.

174 156
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an

175 157
  affix tree is below this threshold, it is deleted. The default is 1.2.

176 158

177
The accuracy of the TreeTagger usually improves, if different settings

178
of the above parameters are tested and the best combination is chosen.

179

180

181
Caveat: Make sure that the lexicon and the training corpus contain no

182
extra blanks. If the word form, for instance, is followed by a blank

183
and a tab character, the blank will be considered part of the word.

184

159
The accuracy of the TreeTagger is usually slightly improved, if different

160
settings of the above parameters are tested and the best combination is

161
chosen.


Also available in: Unified diff