Statistics
| Revision:

root / tmp / org.txm.treetagger.core.macosx / res / macosx / README @ 2854

History | View | Annotate | Download (7.1 kB)

1 826 mdecorde
2 2854 mdecorde
/*****************************************************************************/
3 2854 mdecorde
/* How to use the TreeTagger                                                 */
4 2854 mdecorde
/*****************************************************************************/
5 826 mdecorde
6 826 mdecorde
7 826 mdecorde
The TreeTagger consists of two programs: train-tree-tagger is used to
8 826 mdecorde
create a parameter file from a lexicon and a handtagged corpus.
9 826 mdecorde
tree-tagger expects a parameter file and a text file as arguments and
10 826 mdecorde
annotates the text with part-of-speech tags. The file formats are
11 826 mdecorde
described below. By default, the programs are located in the ./bin
12 826 mdecorde
sub-directory.
13 826 mdecorde
14 826 mdecorde
If either of the programs is called without arguments, it will print
15 826 mdecorde
information about its usage.
16 826 mdecorde
17 826 mdecorde
18 826 mdecorde
Tagging
19 826 mdecorde
-------
20 826 mdecorde
21 826 mdecorde
Tagging is done with the tree-tagger program. It requires at least one
22 826 mdecorde
command line argument, the parameter file. If no input file is specified,
23 826 mdecorde
input will be read from stdin. If neither an input file nor an output file
24 826 mdecorde
is specified, the tagger will print to stdout.
25 826 mdecorde
26 2854 mdecorde
tree-tagger <parameter file> <input file> <output file> {-eps <epsilon>}
27 2854 mdecorde
       {-base} {-proto} {-sgml} {-token} {-lemma} {-beam <threshold>}
28 826 mdecorde
29 826 mdecorde
Description of the command line arguments:
30 826 mdecorde
31 826 mdecorde
* <parameter file>: Name of a parameter file which was created with the
32 826 mdecorde
  train-tree-tagger program.
33 826 mdecorde
* <input file>: Name of the file which is to be tagged. Each token in this
34 2854 mdecorde
  file must be on a separate line. Tokens may contain blanks. It is possible
35 826 mdecorde
  to override the lexical information contained in the parameter file of the
36 826 mdecorde
  tagger by specifying a list of possible tags after a token. This list has
37 2854 mdecorde
  to be preceded by a tab character. The tags are optionally followed by a
38 2854 mdecorde
  floating point value to specify the probability of the tag. Adding such
39 2854 mdecorde
  tag information in the tagger's input is sometimes useful to ensure that
40 826 mdecorde
  certain text-specific expressions are tagged properly.
41 826 mdecorde
  Punctuation marks must be on separate lines as well. Clitics (like "'s",
42 826 mdecorde
  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be
43 826 mdecorde
  separated if they were separated in the training data. (The French and
44 2854 mdecorde
  English parameter files available by ftp, expect separation of clitics).
45 826 mdecorde
  Sample input file:
46 826 mdecorde
    He
47 826 mdecorde
    moved
48 826 mdecorde
    to
49 2854 mdecorde
    New York City	NP 1.0
50 826 mdecorde
    .
51 826 mdecorde
* <output file>: Name of the file to which the tagger should write its output.
52 826 mdecorde
53 826 mdecorde
Further optional command line arguments:
54 826 mdecorde
55 826 mdecorde
* -token: tells the tagger to print the words also.
56 826 mdecorde
* -lemma: tells the tagger to print the lemmas of the words also.
57 826 mdecorde
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
58 826 mdecorde
  with '>' (SGML tags).
59 826 mdecorde
60 2854 mdecorde
The options below are for advanced users. Read the papers on the TreeTagger
61 2854 mdecorde
to fully understand their meaning.
62 826 mdecorde
63 826 mdecorde
* -proto: If this option is specified, the tagger creates a file named
64 826 mdecorde
  "lexicon-protocol.txt", which contains information about the degree of
65 826 mdecorde
  ambiguity and about the other possible tags of a word form. The part of
66 826 mdecorde
  the lexicon in which the word form has been found is also indicated. 'f'
67 826 mdecorde
  means fullform lexicon and 's' means affix lexicon. 'h' means that the
68 826 mdecorde
  word contains a hyphen and that the part of the word following the
69 826 mdecorde
  hyphen has been found in the fullform lexicon.
70 826 mdecorde
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
71 826 mdecorde
  This is the case if a word/tag pair is contained in the lexicon but not
72 2854 mdecorde
  in the training corpus. The default is 0.1. The choice of this parameter
73 2854 mdecorde
  has some minor influence on tagging accuracy.
74 2854 mdecorde
* -beam <threshold>: If the tagger is slow, this option can be used to speed it up.
75 2854 mdecorde
  Good values for <threshold> are in the range 0.001-0.00001.
76 826 mdecorde
* -base: If this option is specified, only lexical information is used
77 826 mdecorde
  for tagging but no contextual information about the preceding tags.
78 826 mdecorde
  This option is only useful in order to obtain a baseline result
79 826 mdecorde
  to which to compare the actual tagger output.
80 826 mdecorde
81 2854 mdecorde
There is another tagger program called "tree-tagger-flush" which
82 2854 mdecorde
flushes the output after reading an empty line. It expects a parameter
83 2854 mdecorde
file as argument and reads from stdin and writes to stdout. No command
84 2854 mdecorde
line options are supported. This program might be useful for
85 2854 mdecorde
implementing wrappers.
86 826 mdecorde
87 826 mdecorde
88 2854 mdecorde
89 826 mdecorde
Training
90 826 mdecorde
--------
91 826 mdecorde
92 826 mdecorde
Training is done with the *train-tree-tagger* program. It expects at least
93 826 mdecorde
four command line arguments which are described below.
94 826 mdecorde
95 2854 mdecorde
train-tree-tagger <lexicon> <open class file> <input file> <output file>
96 2854 mdecorde
            {-cl <context length>} {-dtg <min. decision tree gain>}
97 2854 mdecorde
            {-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}
98 826 mdecorde
99 826 mdecorde
Description of the command line arguments:
100 826 mdecorde
101 826 mdecorde
* <lexicon>: name of a file which contains the fullform lexicon. Each line
102 826 mdecorde
  of the lexicon corresponds to one word form and contains the word form
103 2854 mdecorde
  itself followed by a Tab character and a sequence of tag-lemma pairs.
104 2854 mdecorde
  The tags and lemmata are separated by whitespace.
105 826 mdecorde
  Example:
106 826 mdecorde
107 826 mdecorde
aback	RB aback
108 826 mdecorde
abacuses	NNS abacus
109 2854 mdecorde
abandon	VB abandon VBP abandon
110 2854 mdecorde
abandoned	JJ abandoned VBD abandon VBN abandon
111 826 mdecorde
abandoning	VBG abandon
112 826 mdecorde
113 2854 mdecorde
  Remark: The tagger doesn't need the lemmata actually. If you do not have
114 2854 mdecorde
  the lemma information or if you do not plan to annotate corpora with
115 2854 mdecorde
  lemmas, you can replace the lemma with a dummy value, e.g. "-".
116 826 mdecorde
117 2854 mdecorde
* <open class file>: name of a file which contains a list of open class
118 2854 mdecorde
  tags i.e. possible tags of unknown word forms separated by whitespace.
119 826 mdecorde
  The tagger will use this information when it encounters unknown words,
120 826 mdecorde
  i.e. words which are not contained in the lexicon.
121 826 mdecorde
  Example: (for Penn Treebank tagset)
122 826 mdecorde
123 826 mdecorde
FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ
124 826 mdecorde
125 826 mdecorde
* <input file>: name of a file which contains tagged training data. The data
126 826 mdecorde
  must be in one-word-per-line format. This means that each line contains
127 826 mdecorde
  one token and one tag in that order separated by a tabulator.
128 2854 mdecorde
  Punctuation marks are considered as tokens and must have been tagged as well.
129 826 mdecorde
  Example:
130 826 mdecorde
131 826 mdecorde
Pierre  NP
132 826 mdecorde
Vinken  NP
133 826 mdecorde
,       ,
134 826 mdecorde
61      CD
135 826 mdecorde
years   NNS
136 826 mdecorde
137 826 mdecorde
* <output file>: name of the file in which the resulting tagger parameters
138 826 mdecorde
  are stored.
139 826 mdecorde
140 826 mdecorde
The following parameters are optional. Read the papers on the TreeTagger to
141 826 mdecorde
fully understand their meaning.
142 826 mdecorde
143 826 mdecorde
* -st <sent. tag>: the end-of-sentence part-of-speech tag, i.e. the tag which
144 826 mdecorde
  is assigned to sentence punctuation like ".", "!", "?".
145 826 mdecorde
  Default is "SENT". It is important to set this option properly, if your
146 826 mdecorde
  tag for sentence punctuation is not "SENT".
147 826 mdecorde
* -cl <context length>: number of preceding words forming the statistical
148 826 mdecorde
  context. The default is 2 which corresponds to a trigram context. For
149 826 mdecorde
  small training corpora and/or large tagsets, it could be useful to reduce
150 826 mdecorde
  this parameter to 1.
151 826 mdecorde
* -dtg <min. decision tree gain>: Threshold - If the information gain at a
152 826 mdecorde
  leaf node of the decision tree is below this threshold, the node is deleted.
153 2854 mdecorde
  The default value is 0.7.
154 826 mdecorde
* -ecw <eq. class weight>: weight of the equivalence class based probability
155 2854 mdecorde
  estimates. The default is 0.15.
156 826 mdecorde
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
157 826 mdecorde
  affix tree is below this threshold, it is deleted. The default is 1.2.
158 826 mdecorde
159 2854 mdecorde
The accuracy of the TreeTagger is usually slightly improved, if different
160 2854 mdecorde
settings of the above parameters are tested and the best combination is
161 2854 mdecorde
chosen.