Revision 2854

tmp/org.txm.treetagger.core.macosx/res/macosx/FILES (revision 2854)
1 1

  
2 2
This package contains the TreeTagger, a probabilistic part-of-speech
3
tagger developed by Helmut Schmid. All rights are reserved by the 
4
Institute for Computational Linguistics at the University of Stuttgart.
5
The programs have been compiled for Sun Sparcstations with SunOS operating
6
system version 5.6 or higher. 
3
tagger developed by Helmut Schmid. All rights are reserved by Helmut
4
Schmid.
7 5

  
8 6
Files contained in this package:
9 7

  
......
12 10
- README                How to use the tagger
13 11
- bin/train-tree-tagger training program
14 12
- bin/tree-tagger       tagger programm
13
- bin/separate-punctuation program for tokenization (used by the shell scripts)
15 14
- cmd/lookup.perl       Perl script for pretagging
16 15
- doc/nemlap94.ps       paper describing the TreeTagger
17 16
- doc/sigdat95.ps       paper describing the TreeTagger
18 17

  
19 18
This package can be downloaded at 
20
http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html
19
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger
21 20

  
22 21
Also available at this URL:
23 22
- parameter files
tmp/org.txm.treetagger.core.macosx/res/macosx/doc/sigdat95.ps (revision 2854)
1826 1826
/col7 {1.000 1.000 1.000 srgb} bind def
1827 1827
/col8 {0.000 0.000 0.560 srgb} bind def
1828 1828
/col9 {0.000 0.000 0.690 srgb} bind def
1829
/col10 {0.000 0.000 0.8.020 srgb} bind def
1830
/col11 {0.530 0.8.010 1.000 srgb} bind def
1829
/col10 {0.000 0.000 0.820 srgb} bind def
1830
/col11 {0.530 0.810 1.000 srgb} bind def
1831 1831
/col12 {0.000 0.560 0.000 srgb} bind def
1832 1832
/col13 {0.000 0.690 0.000 srgb} bind def
1833
/col14 {0.000 0.8.020 0.000 srgb} bind def
1833
/col14 {0.000 0.820 0.000 srgb} bind def
1834 1834
/col15 {0.000 0.560 0.560 srgb} bind def
1835 1835
/col16 {0.000 0.690 0.690 srgb} bind def
1836
/col17 {0.000 0.8.020 0.8.020 srgb} bind def
1836
/col17 {0.000 0.820 0.820 srgb} bind def
1837 1837
/col18 {0.560 0.000 0.000 srgb} bind def
1838 1838
/col19 {0.690 0.000 0.000 srgb} bind def
1839
/col20 {0.8.020 0.000 0.000 srgb} bind def
1839
/col20 {0.820 0.000 0.000 srgb} bind def
1840 1840
/col21 {0.560 0.000 0.560 srgb} bind def
1841 1841
/col22 {0.690 0.000 0.690 srgb} bind def
1842
/col23 {0.8.020 0.000 0.8.020 srgb} bind def
1842
/col23 {0.820 0.000 0.820 srgb} bind def
1843 1843
/col24 {0.500 0.190 0.000 srgb} bind def
1844 1844
/col25 {0.630 0.250 0.000 srgb} bind def
1845 1845
/col26 {0.750 0.380 0.000 srgb} bind def
1846 1846
/col27 {1.000 0.500 0.500 srgb} bind def
1847 1847
/col28 {1.000 0.630 0.630 srgb} bind def
1848 1848
/col29 {1.000 0.750 0.750 srgb} bind def
1849
/col30 {1.000 0.8.080 0.8.080 srgb} bind def
1850
/col31 {1.000 0.8.040 0.000 srgb} bind def
1849
/col30 {1.000 0.880 0.880 srgb} bind def
1850
/col31 {1.000 0.840 0.000 srgb} bind def
1851 1851

  
1852 1852
end
1853 1853
save
......
2065 2065
/col7 {1.000 1.000 1.000 srgb} bind def
2066 2066
/col8 {0.000 0.000 0.560 srgb} bind def
2067 2067
/col9 {0.000 0.000 0.690 srgb} bind def
2068
/col10 {0.000 0.000 0.8.020 srgb} bind def
2069
/col11 {0.530 0.8.010 1.000 srgb} bind def
2068
/col10 {0.000 0.000 0.820 srgb} bind def
2069
/col11 {0.530 0.810 1.000 srgb} bind def
2070 2070
/col12 {0.000 0.560 0.000 srgb} bind def
2071 2071
/col13 {0.000 0.690 0.000 srgb} bind def
2072
/col14 {0.000 0.8.020 0.000 srgb} bind def
2072
/col14 {0.000 0.820 0.000 srgb} bind def
2073 2073
/col15 {0.000 0.560 0.560 srgb} bind def
2074 2074
/col16 {0.000 0.690 0.690 srgb} bind def
2075
/col17 {0.000 0.8.020 0.8.020 srgb} bind def
2075
/col17 {0.000 0.820 0.820 srgb} bind def
2076 2076
/col18 {0.560 0.000 0.000 srgb} bind def
2077 2077
/col19 {0.690 0.000 0.000 srgb} bind def
2078
/col20 {0.8.020 0.000 0.000 srgb} bind def
2078
/col20 {0.820 0.000 0.000 srgb} bind def
2079 2079
/col21 {0.560 0.000 0.560 srgb} bind def
2080 2080
/col22 {0.690 0.000 0.690 srgb} bind def
2081
/col23 {0.8.020 0.000 0.8.020 srgb} bind def
2081
/col23 {0.820 0.000 0.820 srgb} bind def
2082 2082
/col24 {0.500 0.190 0.000 srgb} bind def
2083 2083
/col25 {0.630 0.250 0.000 srgb} bind def
2084 2084
/col26 {0.750 0.380 0.000 srgb} bind def
2085 2085
/col27 {1.000 0.500 0.500 srgb} bind def
2086 2086
/col28 {1.000 0.630 0.630 srgb} bind def
2087 2087
/col29 {1.000 0.750 0.750 srgb} bind def
2088
/col30 {1.000 0.8.080 0.8.080 srgb} bind def
2089
/col31 {1.000 0.8.040 0.000 srgb} bind def
2088
/col30 {1.000 0.880 0.880 srgb} bind def
2089
/col31 {1.000 0.840 0.000 srgb} bind def
2090 2090

  
2091 2091
end
2092 2092
save
......
2400 2400
gs 1 -1 sc (ty    \(NN:0.45, JJ:0.35, NP:0.2\)) col-1 sh gr
2401 2401
/Times-Roman ff 600.00 scf sf
2402 2402
6870 9945 m
2403
gs 1 -1 sc (son  \(NP:0.8.0, NN:0.1, JJ:0.1\)) col-1 sh gr
2403
gs 1 -1 sc (son  \(NP:0.8, NN:0.1, JJ:0.1\)) col-1 sh gr
2404 2404
/Times-Roman ff 600.00 scf sf
2405 2405
6840 11550 m
2406
gs 1 -1 sc (man \(NP:0.8.0, NN:0.2\)) col-1 sh gr
2406
gs 1 -1 sc (man \(NP:0.8, NN:0.2\)) col-1 sh gr
2407 2407
/Times-Roman ff 600.00 scf sf
2408 2408
6870 10725 m
2409 2409
gs 1 -1 sc (ton  \(NP:0.9, NN:0.05, JJ:0.05\)) col-1 sh gr
tmp/org.txm.treetagger.core.macosx/res/macosx/COPYRIGHT (revision 2854)
1

  
2
                  ************************
3
                  *  License Conditions  *
4
                  ************************
5

  
6
concerning the use and distribution of the program system 'TreeTagger'.  
7

  
8
The license is granted by
9
Helmut Schmid, Markusstra├če 8, 72760 Reutlingen, Germany
10
Email schmid@cis.lmu.de
11

  
12
1. The user can freely use TreeTagger for evaluation, research, and
13
   teaching purposes. Any commercial usage is forbidden without a
14
   separate commercial license available from the licensor.
15

  
16
2. The user is not allowed to distribute or sell the system to third
17
   parties without written permission from the licensor.
18

  
19
			   NO WARRANTY
20

  
21
3. BECAUSE THE SYSTEM IS LICENSED FREE OF CHARGE, WE PROVIDE
22
ABSOLUTELY NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE
23
LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE LICENSOR PROVIDES THE
24
SYSTEM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR
25
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
26
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK
27
AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH THE USER.
28
SHOULD THE SYSTEM PROVE DEFECTIVE, THE USER ASSUMES THE COST OF ALL
29
NECESSARY SERVICING, REPAIR OR CORRECTION.
30

  
31
4. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW WILL THE LICENSOR BE
32
LIABLE TO THE USER FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST
33
MONIES, OR OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
34
OUT OF THE USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS
35
OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD
36
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAM)
37
THE PROGRAM, EVEN IF THE USER HAS BEEN ADVISED OF THE POSSIBILITY OF
38
SUCH DAMAGES, OR FOR ANY CLAIM BY ANY OTHER PARTY.
39

  
40

  
41
The wording of this license agreement has been adapted from the
42
license of the ALF system by Michael Hanus, Max-Planck-Institut
43
Saarbruecken and the GnuEmacs General Public License (c) 1991 Free
44
Software Foundation. 
tmp/org.txm.treetagger.core.macosx/res/macosx/README (revision 2854)
1 1

  
2
/***************************************************************************/
3
/* How to use the TreeTagger                                               */
4
/* Author: Helmut Schmid, University of Stuttgart, Germany                 */
5
/***************************************************************************/
2
/*****************************************************************************/
3
/* How to use the TreeTagger                                                 */
4
/*****************************************************************************/
6 5

  
7 6

  
8 7
The TreeTagger consists of two programs: train-tree-tagger is used to 
......
24 23
input will be read from stdin. If neither an input file nor an output file
25 24
is specified, the tagger will print to stdout.
26 25

  
27
tree-tagger {-options-} <parameter file> {<input file> {<output file>}}
26
tree-tagger <parameter file> <input file> <output file> {-eps <epsilon>}
27
       {-base} {-proto} {-sgml} {-token} {-lemma} {-beam <threshold>}
28 28

  
29 29
Description of the command line arguments:
30 30

  
31 31
* <parameter file>: Name of a parameter file which was created with the 
32 32
  train-tree-tagger program.
33 33
* <input file>: Name of the file which is to be tagged. Each token in this 
34
  file has to be on a separate line. Tokens may contain blanks. It is possible
34
  file must be on a separate line. Tokens may contain blanks. It is possible
35 35
  to override the lexical information contained in the parameter file of the
36 36
  tagger by specifying a list of possible tags after a token. This list has
37
  to be preceded by a tab character and the elements are separated by tab 
38
  characters. This pretagging feature could be used e.g. to ensure that
37
  to be preceded by a tab character. The tags are optionally followed by a
38
  floating point value to specify the probability of the tag. Adding such
39
  tag information in the tagger's input is sometimes useful to ensure that
39 40
  certain text-specific expressions are tagged properly.
40 41
  Punctuation marks must be on separate lines as well. Clitics (like "'s",
41 42
  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be
42 43
  separated if they were separated in the training data. (The French and
43
  English parameter files available by ftp expect separation of clitics).
44
  English parameter files available by ftp, expect separation of clitics).
44 45
  Sample input file:
45 46
    He
46 47
    moved
47 48
    to
48
    New York City	NP
49
    New York City	NP 1.0
49 50
    .
50 51
* <output file>: Name of the file to which the tagger should write its output.
51 52

  
......
55 56
* -lemma: tells the tagger to print the lemmas of the words also.
56 57
* -sgml: tells the tagger to ignore tokens starting with '<' and ending
57 58
  with '>' (SGML tags).
58
- -no-unknown: If an unknown word is encountered, emit the word form
59
  as lemma. This was previously the default behaviour. Now, the default 
60
  behaviour is to print "<unknown>" as lemma.
61
- -threshold <p>: This option tells the tagger to print all tags of a
62
  word with a probability higher than <p> times the largest probability.
63
  (The tagger will use a different algorithm in this case and the set of
64
  best tags might be different from the tags generated without this
65
  option.)
66
- -prob: Print tag probabilities (in combination with option -threshold)
67
- -pt-with-prob: If this option is specified, then each pretagging tag
68
  (see above) has to be followed by a whitespace and a tag probability 
69
  value.
70
- -pt-with-lemma: If this option is specified, then each pretagging tag
71
  (see above) has to be followed by a whitespace and a lemma. Lemmas may 
72
  contain blanks.
73
  If both -pt-with-prob and -pt-with-lemma have been specified, then each
74
  pretagging tag is followed by a probability and a lemma in that order.
75 59

  
76
The options below are for advanced users. Please, read the papers on the 
77
TreeTagger to fully understand their meaning.
60
The options below are for advanced users. Read the papers on the TreeTagger
61
to fully understand their meaning.
78 62

  
79 63
* -proto: If this option is specified, the tagger creates a file named
80 64
  "lexicon-protocol.txt", which contains information about the degree of
......
85 69
  hyphen has been found in the fullform lexicon.
86 70
* -eps <epsilon>: Value which is used to replace zero lexical frequencies.
87 71
  This is the case if a word/tag pair is contained in the lexicon but not
88
  in the training corpus. The choice of this parameter has only minor
89
  influence on the tagging accuracy.
72
  in the training corpus. The default is 0.1. The choice of this parameter
73
  has some minor influence on tagging accuracy.
74
* -beam <threshold>: If the tagger is slow, this option can be used to speed it up.
75
  Good values for <threshold> are in the range 0.001-0.00001.
90 76
* -base: If this option is specified, only lexical information is used
91 77
  for tagging but no contextual information about the preceding tags.
92 78
  This option is only useful in order to obtain a baseline result
93 79
  to which to compare the actual tagger output.
94 80

  
81
There is another tagger program called "tree-tagger-flush" which
82
flushes the output after reading an empty line. It expects a parameter
83
file as argument and reads from stdin and writes to stdout. No command
84
line options are supported. This program might be useful for
85
implementing wrappers.
95 86

  
96 87

  
88

  
97 89
Training
98 90
--------
99 91

  
100 92
Training is done with the *train-tree-tagger* program. It expects at least
101 93
four command line arguments which are described below.
102 94

  
103
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file>
95
train-tree-tagger <lexicon> <open class file> <input file> <output file> 
96
            {-cl <context length>} {-dtg <min. decision tree gain>}
97
            {-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>}
104 98

  
105 99
Description of the command line arguments:
106 100

  
107 101
* <lexicon>: name of a file which contains the fullform lexicon. Each line 
108 102
  of the lexicon corresponds to one word form and contains the word form 
109
  and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
110
  and each lemma is preceded by a blank or tab character.
103
  itself followed by a Tab character and a sequence of tag-lemma pairs.
104
  The tags and lemmata are separated by whitespace.
111 105
  Example:
112 106

  
113 107
aback	RB aback
114 108
abacuses	NNS abacus
115
abandon	VB abandon	VBP abandon
116
abandoned	JJ abandoned	VBD abandon	VBN abandon
109
abandon	VB abandon VBP abandon
110
abandoned	JJ abandoned VBD abandon VBN abandon
117 111
abandoning	VBG abandon
118 112

  
119
  Attention: Ordinal and cardinal numbers which consist of digits
120
  (like 1, 13, 1278 or 2. and 75.) should not be included in the
121
  lexicon. Otherwise, the tagger will not be able to learn how to tag
122
  numbers which are not listed in the lexicon. Numbers with unusual
123
  tags should be added to the lexicon, however. If the training
124
  program reports an error because the POS tag used for numbers is
125
  unknown, you should add a lexicon entry for one number.
113
  Remark: The tagger doesn't need the lemmata actually. If you do not have
114
  the lemma information or if you do not plan to annotate corpora with
115
  lemmas, you can replace the lemma with a dummy value, e.g. "-".
126 116

  
127
  Remark: The tagger doesn't need the lemmata for tagging actually. If
128
  you do not have the lemma information or if you do not plan to
129
  annotate corpora with lemmas, you can replace the lemma with a dummy
130
  value, e.g. "-".
131

  
132
* <open class file>: name of a file which contains a list of open class tags
133
  i.e. possible tags of unknown word forms separated by whitespace.
117
* <open class file>: name of a file which contains a list of open class
118
  tags i.e. possible tags of unknown word forms separated by whitespace.
134 119
  The tagger will use this information when it encounters unknown words,
135 120
  i.e. words which are not contained in the lexicon.
136 121
  Example: (for Penn Treebank tagset)
......
140 125
* <input file>: name of a file which contains tagged training data. The data
141 126
  must be in one-word-per-line format. This means that each line contains 
142 127
  one token and one tag in that order separated by a tabulator. 
143
  Punctuation marks are considered as tokens and must be tagged as well.
144
  The file should neither contain empty lines nor untagged SGML markup.
128
  Punctuation marks are considered as tokens and must have been tagged as well.
145 129
  Example:
146 130

  
147 131
Pierre  NP
......
166 150
  this parameter to 1.
167 151
* -dtg <min. decision tree gain>: Threshold - If the information gain at a 
168 152
  leaf node of the decision tree is below this threshold, the node is deleted.
169
* -sw <weight>: A smoothing parameter, which determines how much the
170
  probability distribution of some decision tree node is smoothed with the
171
  probability distribution of the parent node.
153
  The default value is 0.7.
172 154
* -ecw <eq. class weight>: weight of the equivalence class based probability
173
  estimates.
155
  estimates. The default is 0.15.
174 156
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an
175 157
  affix tree is below this threshold, it is deleted. The default is 1.2.
176 158

  
177
The accuracy of the TreeTagger usually improves, if different settings
178
of the above parameters are tested and the best combination is chosen.
179

  
180

  
181
Caveat: Make sure that the lexicon and the training corpus contain no
182
extra blanks. If the word form, for instance, is followed by a blank
183
and a tab character, the blank will be considered part of the word.
184

  
159
The accuracy of the TreeTagger is usually slightly improved, if different
160
settings of the above parameters are tested and the best combination is
161
chosen.

Also available in: Unified diff