Statistics
| Revision:

root / tmp / org.txm.treetagger.core.macosx / res / macosx / README @ 2854

History | View | Annotate | Download (7.1 kB)

1

    
2
/*****************************************************************************/
3
/* How to use the TreeTagger                                                 */
4
/*****************************************************************************/
5

    
6

    
7
The TreeTagger consists of two programs: train-tree-tagger is used to 
8
create a parameter file from a lexicon and a handtagged corpus. 
9
tree-tagger expects a parameter file and a text file as arguments and
10
annotates the text with part-of-speech tags. The file formats are
11
described below. By default, the programs are located in the ./bin
12
sub-directory.
13

    
14
If either of the programs is called without arguments, it will print
15
information about its usage.
16

    
17

    
18
Tagging
19
-------
20

    
21
Tagging is done with the tree-tagger program. It requires at least one
22
command line argument, the parameter file. If no input file is specified,
23
input will be read from stdin. If neither an input file nor an output file
24
is specified, the tagger will print to stdout.
25

    
26
tree-tagger <parameter file> <input file> <output file> {-eps <epsilon>}
27
       {-base} {-proto} {-sgml} {-token} {-lemma} {-beam <threshold>}
28

    
29
Description of the command line arguments:
30

    
31
* <parameter file>: Name of a parameter file which was created with the 
32
  train-tree-tagger program.
33
* <input file>: Name of the file which is to be tagged. Each token in this 
34
  file must be on a separate line. Tokens may contain blanks. It is possible
35
  to override the lexical information contained in the parameter file of the
36
  tagger by specifying a list of possible tags after a token. This list has
37
  to be preceded by a tab character. The tags are optionally followed by a
38
  floating point value to specify the probability of the tag. Adding such
39
  tag information in the tagger's input is sometimes useful to ensure that
40
  certain text-specific expressions are tagged properly.
41
  Punctuation marks must be on separate lines as well. Clitics (like "'s",
42
  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be
43
  separated if they were separated in the training data. (The French and
44
  English parameter files available by ftp, expect separation of clitics).
45
  Sample input file:
46
    He
47
    moved
48
    to
49
    New York City	NP 1.0
50
    .
51
* <output file>: Name of the file to which the tagger should write its output.
52

    
53
Further optional command line arguments:
54

    
55
* -token: tells the tagger to print the words also.
56
* -lemma: tells the tagger to print the lemmas of the words also.
57
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
58
  with '>' (SGML tags).
59

    
60
The options below are for advanced users. Read the papers on the TreeTagger
61
to fully understand their meaning.
62

    
63
* -proto: If this option is specified, the tagger creates a file named
64
  "lexicon-protocol.txt", which contains information about the degree of
65
  ambiguity and about the other possible tags of a word form. The part of
66
  the lexicon in which the word form has been found is also indicated. 'f'
67
  means fullform lexicon and 's' means affix lexicon. 'h' means that the
68
  word contains a hyphen and that the part of the word following the
69
  hyphen has been found in the fullform lexicon.
70
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
71
  This is the case if a word/tag pair is contained in the lexicon but not
72
  in the training corpus. The default is 0.1. The choice of this parameter
73
  has some minor influence on tagging accuracy.
74
* -beam <threshold>: If the tagger is slow, this option can be used to speed it up.
75
  Good values for <threshold> are in the range 0.001-0.00001.
76
* -base: If this option is specified, only lexical information is used
77
  for tagging but no contextual information about the preceding tags.
78
  This option is only useful in order to obtain a baseline result
79
  to which to compare the actual tagger output.
80

    
81
There is another tagger program called "tree-tagger-flush" which
82
flushes the output after reading an empty line. It expects a parameter
83
file as argument and reads from stdin and writes to stdout. No command
84
line options are supported. This program might be useful for
85
implementing wrappers.
86

    
87

    
88

    
89
Training
90
--------
91

    
92
Training is done with the *train-tree-tagger* program. It expects at least
93
four command line arguments which are described below.
94

    
95
train-tree-tagger <lexicon> <open class file> <input file> <output file> 
96
            {-cl <context length>} {-dtg <min. decision tree gain>}
97
            {-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}
98

    
99
Description of the command line arguments:
100

    
101
* <lexicon>: name of a file which contains the fullform lexicon. Each line 
102
  of the lexicon corresponds to one word form and contains the word form 
103
  itself followed by a Tab character and a sequence of tag-lemma pairs.
104
  The tags and lemmata are separated by whitespace.
105
  Example:
106

    
107
aback	RB aback
108
abacuses	NNS abacus
109
abandon	VB abandon VBP abandon
110
abandoned	JJ abandoned VBD abandon VBN abandon
111
abandoning	VBG abandon
112

    
113
  Remark: The tagger doesn't need the lemmata actually. If you do not have
114
  the lemma information or if you do not plan to annotate corpora with
115
  lemmas, you can replace the lemma with a dummy value, e.g. "-".
116

    
117
* <open class file>: name of a file which contains a list of open class
118
  tags i.e. possible tags of unknown word forms separated by whitespace.
119
  The tagger will use this information when it encounters unknown words,
120
  i.e. words which are not contained in the lexicon.
121
  Example: (for Penn Treebank tagset)
122

    
123
FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ
124

    
125
* <input file>: name of a file which contains tagged training data. The data
126
  must be in one-word-per-line format. This means that each line contains 
127
  one token and one tag in that order separated by a tabulator. 
128
  Punctuation marks are considered as tokens and must have been tagged as well.
129
  Example:
130

    
131
Pierre  NP
132
Vinken  NP
133
,       ,
134
61      CD
135
years   NNS
136

    
137
* <output file>: name of the file in which the resulting tagger parameters 
138
  are stored.
139

    
140
The following parameters are optional. Read the papers on the TreeTagger to 
141
fully understand their meaning.
142

    
143
* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
144
  is assigned to sentence punctuation like ".", "!", "?". 
145
  Default is "SENT". It is important to set this option properly, if your
146
  tag for sentence punctuation is not "SENT".
147
* -cl <context length>: number of preceding words forming the statistical
148
  context. The default is 2 which corresponds to a trigram context. For
149
  small training corpora and/or large tagsets, it could be useful to reduce
150
  this parameter to 1.
151
* -dtg <min. decision tree gain>: Threshold - If the information gain at a 
152
  leaf node of the decision tree is below this threshold, the node is deleted.
153
  The default value is 0.7.
154
* -ecw <eq. class weight>: weight of the equivalence class based probability
155
  estimates. The default is 0.15.
156
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
157
  affix tree is below this threshold, it is deleted. The default is 1.2.
158

    
159
The accuracy of the TreeTagger is usually slightly improved, if different
160
settings of the above parameters are tested and the best combination is
161
chosen.