Revision 2854 tmp/org.txm.treetagger.core.macosx/res/macosx/README

README (revision 2854)
1 1

  
2
/***************************************************************************/
3
/* How to use the TreeTagger                                               */
4
/* Author: Helmut Schmid, University of Stuttgart, Germany                 */
5
/***************************************************************************/
2
/*****************************************************************************/
3
/* How to use the TreeTagger                                                 */
4
/*****************************************************************************/
6 5

  
7 6

  
8 7
The TreeTagger consists of two programs: train-tree-tagger is used to 
......
24 23
input will be read from stdin. If neither an input file nor an output file
25 24
is specified, the tagger will print to stdout.
26 25

  
27
tree-tagger {-options-} <parameter file> {<input file> {<output file>}}
26
tree-tagger <parameter file> <input file> <output file> {-eps <epsilon>}
27
       {-base} {-proto} {-sgml} {-token} {-lemma} {-beam <threshold>}
28 28

  
29 29
Description of the command line arguments:
30 30

  
31 31
* <parameter file>: Name of a parameter file which was created with the 
32 32
  train-tree-tagger program.
33 33
* <input file>: Name of the file which is to be tagged. Each token in this 
34
  file has to be on a separate line. Tokens may contain blanks. It is possible
34
  file must be on a separate line. Tokens may contain blanks. It is possible
35 35
  to override the lexical information contained in the parameter file of the
36 36
  tagger by specifying a list of possible tags after a token. This list has
37
  to be preceded by a tab character and the elements are separated by tab 
38
  characters. This pretagging feature could be used e.g. to ensure that
37
  to be preceded by a tab character. The tags are optionally followed by a
38
  floating point value to specify the probability of the tag. Adding such
39
  tag information in the tagger's input is sometimes useful to ensure that
39 40
  certain text-specific expressions are tagged properly.
40 41
  Punctuation marks must be on separate lines as well. Clitics (like "'s",
41 42
  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be
42 43
  separated if they were separated in the training data. (The French and
43
  English parameter files available by ftp expect separation of clitics).
44
  English parameter files available by ftp, expect separation of clitics).
44 45
  Sample input file:
45 46
    He
46 47
    moved
47 48
    to
48
    New York City	NP
49
    New York City	NP 1.0
49 50
    .
50 51
* <output file>: Name of the file to which the tagger should write its output.
51 52

  
......
55 56
* -lemma: tells the tagger to print the lemmas of the words also.
56 57
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
57 58
  with '>' (SGML tags).
58
- -no-unknown: If an unknown word is encountered, emit the word form
59
  as lemma. This was previously the default behaviour. Now, the default 
60
  behaviour is to print "<unknown>" as lemma.
61
- -threshold <p>: This option tells the tagger to print all tags of a
62
  word with a probability higher than <p> times the largest probability.
63
  (The tagger will use a different algorithm in this case and the set of
64
  best tags might be different from the tags generated without this
65
  option.)
66
- -prob: Print tag probabilities (in combination with option -threshold)
67
- -pt-with-prob: If this option is specified, then each pretagging tag
68
  (see above) has to be followed by a whitespace and a tag probability 
69
  value.
70
- -pt-with-lemma: If this option is specified, then each pretagging tag
71
  (see above) has to be followed by a whitespace and a lemma. Lemmas may 
72
  contain blanks.
73
  If both -pt-with-prob and -pt-with-lemma have been specified, then each
74
  pretagging tag is followed by a probability and a lemma in that order.
75 59

  
76
The options below are for advanced users. Please, read the papers on the 
77
TreeTagger to fully understand their meaning.
60
The options below are for advanced users. Read the papers on the TreeTagger
61
to fully understand their meaning.
78 62

  
79 63
* -proto: If this option is specified, the tagger creates a file named
80 64
  "lexicon-protocol.txt", which contains information about the degree of
......
85 69
  hyphen has been found in the fullform lexicon.
86 70
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
87 71
  This is the case if a word/tag pair is contained in the lexicon but not
88
  in the training corpus. The choice of this parameter has only minor
89
  influence on the tagging accuracy.
72
  in the training corpus. The default is 0.1. The choice of this parameter
73
  has some minor influence on tagging accuracy.
74
* -beam <threshold>: If the tagger is slow, this option can be used to speed it up.
75
  Good values for <threshold> are in the range 0.001-0.00001.
90 76
* -base: If this option is specified, only lexical information is used
91 77
  for tagging but no contextual information about the preceding tags.
92 78
  This option is only useful in order to obtain a baseline result
93 79
  to which to compare the actual tagger output.
94 80

  
81
There is another tagger program called "tree-tagger-flush" which
82
flushes the output after reading an empty line. It expects a parameter
83
file as argument and reads from stdin and writes to stdout. No command
84
line options are supported. This program might be useful for
85
implementing wrappers.
95 86

  
96 87

  
88

  
97 89
Training
98 90
--------
99 91

  
100 92
Training is done with the *train-tree-tagger* program. It expects at least
101 93
four command line arguments which are described below.
102 94

  
103
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>
95
train-tree-tagger <lexicon> <open class file> <input file> <output file> 
96
            {-cl <context length>} {-dtg <min. decision tree gain>}
97
            {-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}
104 98

  
105 99
Description of the command line arguments:
106 100

  
107 101
* <lexicon>: name of a file which contains the fullform lexicon. Each line 
108 102
  of the lexicon corresponds to one word form and contains the word form 
109
  and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
110
  and each lemma is preceded by a blank or tab character.
103
  itself followed by a Tab character and a sequence of tag-lemma pairs.
104
  The tags and lemmata are separated by whitespace.
111 105
  Example:
112 106

  
113 107
aback	RB aback
114 108
abacuses	NNS abacus
115
abandon	VB abandon	VBP abandon
116
abandoned	JJ abandoned	VBD abandon	VBN abandon
109
abandon	VB abandon VBP abandon
110
abandoned	JJ abandoned VBD abandon VBN abandon
117 111
abandoning	VBG abandon
118 112

  
119
  Attention: Ordinal and cardinal numbers which consist of digits
120
  (like 1, 13, 1278 or 2. and 75.) should not be included in the
121
  lexicon. Otherwise, the tagger will not be able to learn how to tag
122
  numbers which are not listed in the lexicon. Numbers with unusual
123
  tags should be added to the lexicon, however. If the training
124
  program reports an error because the POS tag used for numbers is
125
  unknown, you should add a lexicon entry for one number.
113
  Remark: The tagger doesn't need the lemmata actually. If you do not have
114
  the lemma information or if you do not plan to annotate corpora with
115
  lemmas, you can replace the lemma with a dummy value, e.g. "-".
126 116

  
127
  Remark: The tagger doesn't need the lemmata for tagging actually. If
128
  you do not have the lemma information or if you do not plan to
129
  annotate corpora with lemmas, you can replace the lemma with a dummy
130
  value, e.g. "-".
131

  
132
* <open class file>: name of a file which contains a list of open class tags
133
  i.e. possible tags of unknown word forms separated by whitespace.
117
* <open class file>: name of a file which contains a list of open class
118
  tags i.e. possible tags of unknown word forms separated by whitespace.
134 119
  The tagger will use this information when it encounters unknown words,
135 120
  i.e. words which are not contained in the lexicon.
136 121
  Example: (for Penn Treebank tagset)
......
140 125
* <input file>: name of a file which contains tagged training data. The data
141 126
  must be in one-word-per-line format. This means that each line contains 
142 127
  one token and one tag in that order separated by a tabulator. 
143
  Punctuation marks are considered as tokens and must be tagged as well.
144
  The file should neither contain empty lines nor untagged SGML markup.
128
  Punctuation marks are considered as tokens and must have been tagged as well.
145 129
  Example:
146 130

  
147 131
Pierre  NP
......
166 150
  this parameter to 1.
167 151
* -dtg <min. decision tree gain>: Threshold - If the information gain at a 
168 152
  leaf node of the decision tree is below this threshold, the node is deleted.
169
* -sw <weight>: A smoothing parameter, which determines how much the
170
  probability distribution of some decision tree node is smoothed with the
171
  probability distribution of the parent node.
153
  The default value is 0.7.
172 154
* -ecw <eq. class weight>: weight of the equivalence class based probability
173
  estimates.
155
  estimates. The default is 0.15.
174 156
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
175 157
  affix tree is below this threshold, it is deleted. The default is 1.2.
176 158

  
177
The accuracy of the TreeTagger usually improves, if different settings
178
of the above parameters are tested and the best combination is chosen.
179

  
180

  
181
Caveat: Make sure that the lexicon and the training corpus contain no
182
extra blanks. If the word form, for instance, is followed by a blank
183
and a tab character, the blank will be considered part of the word.
184

  
159
The accuracy of the TreeTagger is usually slightly improved, if different
160
settings of the above parameters are tested and the best combination is
161
chosen.

Also available in: Unified diff