Statistics
| Revision:

root / tmp / org.txm.treetagger.core.macosx / res / macosx / README @ 1120

History | View | Annotate | Download (8.3 kB)

1 826 mdecorde
2 826 mdecorde
/***************************************************************************/
3 826 mdecorde
/* How to use the TreeTagger                                               */
4 826 mdecorde
/* Author: Helmut Schmid, University of Stuttgart, Germany                 */
5 826 mdecorde
/***************************************************************************/
6 826 mdecorde
7 826 mdecorde
8 826 mdecorde
The TreeTagger consists of two programs: train-tree-tagger is used to
9 826 mdecorde
create a parameter file from a lexicon and a handtagged corpus.
10 826 mdecorde
tree-tagger expects a parameter file and a text file as arguments and
11 826 mdecorde
annotates the text with part-of-speech tags. The file formats are
12 826 mdecorde
described below. By default, the programs are located in the ./bin
13 826 mdecorde
sub-directory.
14 826 mdecorde
15 826 mdecorde
If either of the programs is called without arguments, it will print
16 826 mdecorde
information about its usage.
17 826 mdecorde
18 826 mdecorde
19 826 mdecorde
Tagging
20 826 mdecorde
-------
21 826 mdecorde
22 826 mdecorde
Tagging is done with the tree-tagger program. It requires at least one
23 826 mdecorde
command line argument, the parameter file. If no input file is specified,
24 826 mdecorde
input will be read from stdin. If neither an input file nor an output file
25 826 mdecorde
is specified, the tagger will print to stdout.
26 826 mdecorde
27 826 mdecorde
tree-tagger {-options-} <parameter file> {<input file> {<output file>}}
28 826 mdecorde
29 826 mdecorde
Description of the command line arguments:
30 826 mdecorde
31 826 mdecorde
* <parameter file>: Name of a parameter file which was created with the
32 826 mdecorde
  train-tree-tagger program.
33 826 mdecorde
* <input file>: Name of the file which is to be tagged. Each token in this
34 826 mdecorde
  file has to be on a separate line. Tokens may contain blanks. It is possible
35 826 mdecorde
  to override the lexical information contained in the parameter file of the
36 826 mdecorde
  tagger by specifying a list of possible tags after a token. This list has
37 826 mdecorde
  to be preceded by a tab character and the elements are separated by tab
38 826 mdecorde
  characters. This pretagging feature could be used e.g. to ensure that
39 826 mdecorde
  certain text-specific expressions are tagged properly.
40 826 mdecorde
  Punctuation marks must be on separate lines as well. Clitics (like "'s",
41 826 mdecorde
  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be
42 826 mdecorde
  separated if they were separated in the training data. (The French and
43 826 mdecorde
  English parameter files available by ftp expect separation of clitics).
44 826 mdecorde
  Sample input file:
45 826 mdecorde
    He
46 826 mdecorde
    moved
47 826 mdecorde
    to
48 826 mdecorde
    New York City	NP
49 826 mdecorde
    .
50 826 mdecorde
* <output file>: Name of the file to which the tagger should write its output.
51 826 mdecorde
52 826 mdecorde
Further optional command line arguments:
53 826 mdecorde
54 826 mdecorde
* -token: tells the tagger to print the words also.
55 826 mdecorde
* -lemma: tells the tagger to print the lemmas of the words also.
56 826 mdecorde
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
57 826 mdecorde
  with '>' (SGML tags).
58 826 mdecorde
- -no-unknown: If an unknown word is encountered, emit the word form
59 826 mdecorde
  as lemma. This was previously the default behaviour. Now, the default
60 826 mdecorde
  behaviour is to print "<unknown>" as lemma.
61 826 mdecorde
- -threshold <p>: This option tells the tagger to print all tags of a
62 826 mdecorde
  word with a probability higher than <p> times the largest probability.
63 826 mdecorde
  (The tagger will use a different algorithm in this case and the set of
64 826 mdecorde
  best tags might be different from the tags generated without this
65 826 mdecorde
  option.)
66 826 mdecorde
- -prob: Print tag probabilities (in combination with option -threshold)
67 826 mdecorde
- -pt-with-prob: If this option is specified, then each pretagging tag
68 826 mdecorde
  (see above) has to be followed by a whitespace and a tag probability
69 826 mdecorde
  value.
70 826 mdecorde
- -pt-with-lemma: If this option is specified, then each pretagging tag
71 826 mdecorde
  (see above) has to be followed by a whitespace and a lemma. Lemmas may
72 826 mdecorde
  contain blanks.
73 826 mdecorde
  If both -pt-with-prob and -pt-with-lemma have been specified, then each
74 826 mdecorde
  pretagging tag is followed by a probability and a lemma in that order.
75 826 mdecorde
76 826 mdecorde
The options below are for advanced users. Please, read the papers on the
77 826 mdecorde
TreeTagger to fully understand their meaning.
78 826 mdecorde
79 826 mdecorde
* -proto: If this option is specified, the tagger creates a file named
80 826 mdecorde
  "lexicon-protocol.txt", which contains information about the degree of
81 826 mdecorde
  ambiguity and about the other possible tags of a word form. The part of
82 826 mdecorde
  the lexicon in which the word form has been found is also indicated. 'f'
83 826 mdecorde
  means fullform lexicon and 's' means affix lexicon. 'h' means that the
84 826 mdecorde
  word contains a hyphen and that the part of the word following the
85 826 mdecorde
  hyphen has been found in the fullform lexicon.
86 826 mdecorde
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
87 826 mdecorde
  This is the case if a word/tag pair is contained in the lexicon but not
88 826 mdecorde
  in the training corpus. The choice of this parameter has only minor
89 826 mdecorde
  influence on the tagging accuracy.
90 826 mdecorde
* -base: If this option is specified, only lexical information is used
91 826 mdecorde
  for tagging but no contextual information about the preceding tags.
92 826 mdecorde
  This option is only useful in order to obtain a baseline result
93 826 mdecorde
  to which to compare the actual tagger output.
94 826 mdecorde
95 826 mdecorde
96 826 mdecorde
97 826 mdecorde
Training
98 826 mdecorde
--------
99 826 mdecorde
100 826 mdecorde
Training is done with the *train-tree-tagger* program. It expects at least
101 826 mdecorde
four command line arguments which are described below.
102 826 mdecorde
103 826 mdecorde
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>
104 826 mdecorde
105 826 mdecorde
Description of the command line arguments:
106 826 mdecorde
107 826 mdecorde
* <lexicon>: name of a file which contains the fullform lexicon. Each line
108 826 mdecorde
  of the lexicon corresponds to one word form and contains the word form
109 826 mdecorde
  and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
110 826 mdecorde
  and each lemma is preceded by a blank or tab character.
111 826 mdecorde
  Example:
112 826 mdecorde
113 826 mdecorde
aback	RB aback
114 826 mdecorde
abacuses	NNS abacus
115 826 mdecorde
abandon	VB abandon	VBP abandon
116 826 mdecorde
abandoned	JJ abandoned	VBD abandon	VBN abandon
117 826 mdecorde
abandoning	VBG abandon
118 826 mdecorde
119 826 mdecorde
  Attention: Ordinal and cardinal numbers which consist of digits
120 826 mdecorde
  (like 1, 13, 1278 or 2. and 75.) should not be included in the
121 826 mdecorde
  lexicon. Otherwise, the tagger will not be able to learn how to tag
122 826 mdecorde
  numbers which are not listed in the lexicon. Numbers with unusual
123 826 mdecorde
  tags should be added to the lexicon, however. If the training
124 826 mdecorde
  program reports an error because the POS tag used for numbers is
125 826 mdecorde
  unknown, you should add a lexicon entry for one number.
126 826 mdecorde
127 826 mdecorde
  Remark: The tagger doesn't need the lemmata for tagging actually. If
128 826 mdecorde
  you do not have the lemma information or if you do not plan to
129 826 mdecorde
  annotate corpora with lemmas, you can replace the lemma with a dummy
130 826 mdecorde
  value, e.g. "-".
131 826 mdecorde
132 826 mdecorde
* <open class file>: name of a file which contains a list of open class tags
133 826 mdecorde
  i.e. possible tags of unknown word forms separated by whitespace.
134 826 mdecorde
  The tagger will use this information when it encounters unknown words,
135 826 mdecorde
  i.e. words which are not contained in the lexicon.
136 826 mdecorde
  Example: (for Penn Treebank tagset)
137 826 mdecorde
138 826 mdecorde
FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ
139 826 mdecorde
140 826 mdecorde
* <input file>: name of a file which contains tagged training data. The data
141 826 mdecorde
  must be in one-word-per-line format. This means that each line contains
142 826 mdecorde
  one token and one tag in that order separated by a tabulator.
143 826 mdecorde
  Punctuation marks are considered as tokens and must be tagged as well.
144 826 mdecorde
  The file should neither contain empty lines nor untagged SGML markup.
145 826 mdecorde
  Example:
146 826 mdecorde
147 826 mdecorde
Pierre  NP
148 826 mdecorde
Vinken  NP
149 826 mdecorde
,       ,
150 826 mdecorde
61      CD
151 826 mdecorde
years   NNS
152 826 mdecorde
153 826 mdecorde
* <output file>: name of the file in which the resulting tagger parameters
154 826 mdecorde
  are stored.
155 826 mdecorde
156 826 mdecorde
The following parameters are optional. Read the papers on the TreeTagger to
157 826 mdecorde
fully understand their meaning.
158 826 mdecorde
159 826 mdecorde
* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
160 826 mdecorde
  is assigned to sentence punctuation like ".", "!", "?".
161 826 mdecorde
  Default is "SENT". It is important to set this option properly, if your
162 826 mdecorde
  tag for sentence punctuation is not "SENT".
163 826 mdecorde
* -cl <context length>: number of preceding words forming the statistical
164 826 mdecorde
  context. The default is 2 which corresponds to a trigram context. For
165 826 mdecorde
  small training corpora and/or large tagsets, it could be useful to reduce
166 826 mdecorde
  this parameter to 1.
167 826 mdecorde
* -dtg <min. decision tree gain>: Threshold - If the information gain at a
168 826 mdecorde
  leaf node of the decision tree is below this threshold, the node is deleted.
169 826 mdecorde
* -sw <weight>: A smoothing parameter, which determines how much the
170 826 mdecorde
  probability distribution of some decision tree node is smoothed with the
171 826 mdecorde
  probability distribution of the parent node.
172 826 mdecorde
* -ecw <eq. class weight>: weight of the equivalence class based probability
173 826 mdecorde
  estimates.
174 826 mdecorde
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
175 826 mdecorde
  affix tree is below this threshold, it is deleted. The default is 1.2.
176 826 mdecorde
177 826 mdecorde
The accuracy of the TreeTagger usually improves, if different settings
178 826 mdecorde
of the above parameters are tested and the best combination is chosen.
179 826 mdecorde
180 826 mdecorde
181 826 mdecorde
Caveat: Make sure that the lexicon and the training corpus contain no
182 826 mdecorde
extra blanks. If the word form, for instance, is followed by a blank
183 826 mdecorde
and a tab character, the blank will be considered part of the word.