Révision 2854
tmp/org.txm.treetagger.core.macosx/res/macosx/doc/sigdat95.ps (revision 2854) | ||
---|---|---|
1826 | 1826 |
/col7 {1.000 1.000 1.000 srgb} bind def |
1827 | 1827 |
/col8 {0.000 0.000 0.560 srgb} bind def |
1828 | 1828 |
/col9 {0.000 0.000 0.690 srgb} bind def |
1829 |
/col10 {0.000 0.000 0.8.020 srgb} bind def
|
|
1830 |
/col11 {0.530 0.8.010 1.000 srgb} bind def
|
|
1829 |
/col10 {0.000 0.000 0.820 srgb} bind def |
|
1830 |
/col11 {0.530 0.810 1.000 srgb} bind def |
|
1831 | 1831 |
/col12 {0.000 0.560 0.000 srgb} bind def |
1832 | 1832 |
/col13 {0.000 0.690 0.000 srgb} bind def |
1833 |
/col14 {0.000 0.8.020 0.000 srgb} bind def
|
|
1833 |
/col14 {0.000 0.820 0.000 srgb} bind def |
|
1834 | 1834 |
/col15 {0.000 0.560 0.560 srgb} bind def |
1835 | 1835 |
/col16 {0.000 0.690 0.690 srgb} bind def |
1836 |
/col17 {0.000 0.8.020 0.8.020 srgb} bind def
|
|
1836 |
/col17 {0.000 0.820 0.820 srgb} bind def
|
|
1837 | 1837 |
/col18 {0.560 0.000 0.000 srgb} bind def |
1838 | 1838 |
/col19 {0.690 0.000 0.000 srgb} bind def |
1839 |
/col20 {0.8.020 0.000 0.000 srgb} bind def
|
|
1839 |
/col20 {0.820 0.000 0.000 srgb} bind def |
|
1840 | 1840 |
/col21 {0.560 0.000 0.560 srgb} bind def |
1841 | 1841 |
/col22 {0.690 0.000 0.690 srgb} bind def |
1842 |
/col23 {0.8.020 0.000 0.8.020 srgb} bind def
|
|
1842 |
/col23 {0.820 0.000 0.820 srgb} bind def
|
|
1843 | 1843 |
/col24 {0.500 0.190 0.000 srgb} bind def |
1844 | 1844 |
/col25 {0.630 0.250 0.000 srgb} bind def |
1845 | 1845 |
/col26 {0.750 0.380 0.000 srgb} bind def |
1846 | 1846 |
/col27 {1.000 0.500 0.500 srgb} bind def |
1847 | 1847 |
/col28 {1.000 0.630 0.630 srgb} bind def |
1848 | 1848 |
/col29 {1.000 0.750 0.750 srgb} bind def |
1849 |
/col30 {1.000 0.8.080 0.8.080 srgb} bind def
|
|
1850 |
/col31 {1.000 0.8.040 0.000 srgb} bind def
|
|
1849 |
/col30 {1.000 0.880 0.880 srgb} bind def
|
|
1850 |
/col31 {1.000 0.840 0.000 srgb} bind def |
|
1851 | 1851 |
|
1852 | 1852 |
end |
1853 | 1853 |
save |
... | ... | |
2065 | 2065 |
/col7 {1.000 1.000 1.000 srgb} bind def |
2066 | 2066 |
/col8 {0.000 0.000 0.560 srgb} bind def |
2067 | 2067 |
/col9 {0.000 0.000 0.690 srgb} bind def |
2068 |
/col10 {0.000 0.000 0.8.020 srgb} bind def
|
|
2069 |
/col11 {0.530 0.8.010 1.000 srgb} bind def
|
|
2068 |
/col10 {0.000 0.000 0.820 srgb} bind def |
|
2069 |
/col11 {0.530 0.810 1.000 srgb} bind def |
|
2070 | 2070 |
/col12 {0.000 0.560 0.000 srgb} bind def |
2071 | 2071 |
/col13 {0.000 0.690 0.000 srgb} bind def |
2072 |
/col14 {0.000 0.8.020 0.000 srgb} bind def
|
|
2072 |
/col14 {0.000 0.820 0.000 srgb} bind def |
|
2073 | 2073 |
/col15 {0.000 0.560 0.560 srgb} bind def |
2074 | 2074 |
/col16 {0.000 0.690 0.690 srgb} bind def |
2075 |
/col17 {0.000 0.8.020 0.8.020 srgb} bind def
|
|
2075 |
/col17 {0.000 0.820 0.820 srgb} bind def
|
|
2076 | 2076 |
/col18 {0.560 0.000 0.000 srgb} bind def |
2077 | 2077 |
/col19 {0.690 0.000 0.000 srgb} bind def |
2078 |
/col20 {0.8.020 0.000 0.000 srgb} bind def
|
|
2078 |
/col20 {0.820 0.000 0.000 srgb} bind def |
|
2079 | 2079 |
/col21 {0.560 0.000 0.560 srgb} bind def |
2080 | 2080 |
/col22 {0.690 0.000 0.690 srgb} bind def |
2081 |
/col23 {0.8.020 0.000 0.8.020 srgb} bind def
|
|
2081 |
/col23 {0.820 0.000 0.820 srgb} bind def
|
|
2082 | 2082 |
/col24 {0.500 0.190 0.000 srgb} bind def |
2083 | 2083 |
/col25 {0.630 0.250 0.000 srgb} bind def |
2084 | 2084 |
/col26 {0.750 0.380 0.000 srgb} bind def |
2085 | 2085 |
/col27 {1.000 0.500 0.500 srgb} bind def |
2086 | 2086 |
/col28 {1.000 0.630 0.630 srgb} bind def |
2087 | 2087 |
/col29 {1.000 0.750 0.750 srgb} bind def |
2088 |
/col30 {1.000 0.8.080 0.8.080 srgb} bind def
|
|
2089 |
/col31 {1.000 0.8.040 0.000 srgb} bind def
|
|
2088 |
/col30 {1.000 0.880 0.880 srgb} bind def
|
|
2089 |
/col31 {1.000 0.840 0.000 srgb} bind def |
|
2090 | 2090 |
|
2091 | 2091 |
end |
2092 | 2092 |
save |
... | ... | |
2400 | 2400 |
gs 1 -1 sc (ty \(NN:0.45, JJ:0.35, NP:0.2\)) col-1 sh gr |
2401 | 2401 |
/Times-Roman ff 600.00 scf sf |
2402 | 2402 |
6870 9945 m |
2403 |
gs 1 -1 sc (son \(NP:0.8.0, NN:0.1, JJ:0.1\)) col-1 sh gr
|
|
2403 |
gs 1 -1 sc (son \(NP:0.8, NN:0.1, JJ:0.1\)) col-1 sh gr |
|
2404 | 2404 |
/Times-Roman ff 600.00 scf sf |
2405 | 2405 |
6840 11550 m |
2406 |
gs 1 -1 sc (man \(NP:0.8.0, NN:0.2\)) col-1 sh gr
|
|
2406 |
gs 1 -1 sc (man \(NP:0.8, NN:0.2\)) col-1 sh gr |
|
2407 | 2407 |
/Times-Roman ff 600.00 scf sf |
2408 | 2408 |
6870 10725 m |
2409 | 2409 |
gs 1 -1 sc (ton \(NP:0.9, NN:0.05, JJ:0.05\)) col-1 sh gr |
tmp/org.txm.treetagger.core.macosx/res/macosx/COPYRIGHT (revision 2854) | ||
---|---|---|
1 |
|
|
2 |
************************ |
|
3 |
* License Conditions * |
|
4 |
************************ |
|
5 |
|
|
6 |
concerning the use and distribution of the program system 'TreeTagger'. |
|
7 |
|
|
8 |
The license is granted by |
|
9 |
Helmut Schmid, Markusstraße 8, 72760 Reutlingen, Germany |
|
10 |
Email schmid@cis.lmu.de |
|
11 |
|
|
12 |
1. The user can freely use TreeTagger for evaluation, research, and |
|
13 |
teaching purposes. Any commercial usage is forbidden without a |
|
14 |
separate commercial license available from the licensor. |
|
15 |
|
|
16 |
2. The user is not allowed to distribute or sell the system to third |
|
17 |
parties without written permission from the licensor. |
|
18 |
|
|
19 |
NO WARRANTY |
|
20 |
|
|
21 |
3. BECAUSE THE SYSTEM IS LICENSED FREE OF CHARGE, WE PROVIDE |
|
22 |
ABSOLUTELY NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE |
|
23 |
LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE LICENSOR PROVIDES THE |
|
24 |
SYSTEM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR |
|
25 |
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF |
|
26 |
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK |
|
27 |
AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH THE USER. |
|
28 |
SHOULD THE SYSTEM PROVE DEFECTIVE, THE USER ASSUMES THE COST OF ALL |
|
29 |
NECESSARY SERVICING, REPAIR OR CORRECTION. |
|
30 |
|
|
31 |
4. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW WILL THE LICENSOR BE |
|
32 |
LIABLE TO THE USER FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST |
|
33 |
MONIES, OR OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING |
|
34 |
OUT OF THE USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS |
|
35 |
OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD |
|
36 |
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAM) |
|
37 |
THE PROGRAM, EVEN IF THE USER HAS BEEN ADVISED OF THE POSSIBILITY OF |
|
38 |
SUCH DAMAGES, OR FOR ANY CLAIM BY ANY OTHER PARTY. |
|
39 |
|
|
40 |
|
|
41 |
The wording of this license agreement has been adapted from the |
|
42 |
license of the ALF system by Michael Hanus, Max-Planck-Institut |
|
43 |
Saarbruecken and the GnuEmacs General Public License (c) 1991 Free |
|
44 |
Software Foundation. |
tmp/org.txm.treetagger.core.macosx/res/macosx/README (revision 2854) | ||
---|---|---|
1 | 1 |
|
2 |
/***************************************************************************/ |
|
3 |
/* How to use the TreeTagger */ |
|
4 |
/* Author: Helmut Schmid, University of Stuttgart, Germany */ |
|
5 |
/***************************************************************************/ |
|
2 |
/*****************************************************************************/ |
|
3 |
/* How to use the TreeTagger */ |
|
4 |
/*****************************************************************************/ |
|
6 | 5 |
|
7 | 6 |
|
8 | 7 |
The TreeTagger consists of two programs: train-tree-tagger is used to |
... | ... | |
24 | 23 |
input will be read from stdin. If neither an input file nor an output file |
25 | 24 |
is specified, the tagger will print to stdout. |
26 | 25 |
|
27 |
tree-tagger {-options-} <parameter file> {<input file> {<output file>}} |
|
26 |
tree-tagger <parameter file> <input file> <output file> {-eps <epsilon>} |
|
27 |
{-base} {-proto} {-sgml} {-token} {-lemma} {-beam <threshold>} |
|
28 | 28 |
|
29 | 29 |
Description of the command line arguments: |
30 | 30 |
|
31 | 31 |
* <parameter file>: Name of a parameter file which was created with the |
32 | 32 |
train-tree-tagger program. |
33 | 33 |
* <input file>: Name of the file which is to be tagged. Each token in this |
34 |
file has to be on a separate line. Tokens may contain blanks. It is possible
|
|
34 |
file must be on a separate line. Tokens may contain blanks. It is possible
|
|
35 | 35 |
to override the lexical information contained in the parameter file of the |
36 | 36 |
tagger by specifying a list of possible tags after a token. This list has |
37 |
to be preceded by a tab character and the elements are separated by tab |
|
38 |
characters. This pretagging feature could be used e.g. to ensure that |
|
37 |
to be preceded by a tab character. The tags are optionally followed by a |
|
38 |
floating point value to specify the probability of the tag. Adding such |
|
39 |
tag information in the tagger's input is sometimes useful to ensure that |
|
39 | 40 |
certain text-specific expressions are tagged properly. |
40 | 41 |
Punctuation marks must be on separate lines as well. Clitics (like "'s", |
41 | 42 |
"'re", and "'d" in English or "-la" and "-t-elle" in French) should be |
42 | 43 |
separated if they were separated in the training data. (The French and |
43 |
English parameter files available by ftp expect separation of clitics). |
|
44 |
English parameter files available by ftp, expect separation of clitics).
|
|
44 | 45 |
Sample input file: |
45 | 46 |
He |
46 | 47 |
moved |
47 | 48 |
to |
48 |
New York City NP |
|
49 |
New York City NP 1.0
|
|
49 | 50 |
. |
50 | 51 |
* <output file>: Name of the file to which the tagger should write its output. |
51 | 52 |
|
... | ... | |
55 | 56 |
* -lemma: tells the tagger to print the lemmas of the words also. |
56 | 57 |
* -sgml: tells the tagger to ignore tokens starting with '<' and ending |
57 | 58 |
with '>' (SGML tags). |
58 |
- -no-unknown: If an unknown word is encountered, emit the word form |
|
59 |
as lemma. This was previously the default behaviour. Now, the default |
|
60 |
behaviour is to print "<unknown>" as lemma. |
|
61 |
- -threshold <p>: This option tells the tagger to print all tags of a |
|
62 |
word with a probability higher than <p> times the largest probability. |
|
63 |
(The tagger will use a different algorithm in this case and the set of |
|
64 |
best tags might be different from the tags generated without this |
|
65 |
option.) |
|
66 |
- -prob: Print tag probabilities (in combination with option -threshold) |
|
67 |
- -pt-with-prob: If this option is specified, then each pretagging tag |
|
68 |
(see above) has to be followed by a whitespace and a tag probability |
|
69 |
value. |
|
70 |
- -pt-with-lemma: If this option is specified, then each pretagging tag |
|
71 |
(see above) has to be followed by a whitespace and a lemma. Lemmas may |
|
72 |
contain blanks. |
|
73 |
If both -pt-with-prob and -pt-with-lemma have been specified, then each |
|
74 |
pretagging tag is followed by a probability and a lemma in that order. |
|
75 | 59 |
|
76 |
The options below are for advanced users. Please, read the papers on the
|
|
77 |
TreeTagger to fully understand their meaning.
|
|
60 |
The options below are for advanced users. Read the papers on the TreeTagger
|
|
61 |
to fully understand their meaning. |
|
78 | 62 |
|
79 | 63 |
* -proto: If this option is specified, the tagger creates a file named |
80 | 64 |
"lexicon-protocol.txt", which contains information about the degree of |
... | ... | |
85 | 69 |
hyphen has been found in the fullform lexicon. |
86 | 70 |
* -eps <epsilon>: Value which is used to replace zero lexical frequencies. |
87 | 71 |
This is the case if a word/tag pair is contained in the lexicon but not |
88 |
in the training corpus. The choice of this parameter has only minor |
|
89 |
influence on the tagging accuracy. |
|
72 |
in the training corpus. The default is 0.1. The choice of this parameter |
|
73 |
has some minor influence on tagging accuracy. |
|
74 |
* -beam <threshold>: If the tagger is slow, this option can be used to speed it up. |
|
75 |
Good values for <threshold> are in the range 0.001-0.00001. |
|
90 | 76 |
* -base: If this option is specified, only lexical information is used |
91 | 77 |
for tagging but no contextual information about the preceding tags. |
92 | 78 |
This option is only useful in order to obtain a baseline result |
93 | 79 |
to which to compare the actual tagger output. |
94 | 80 |
|
81 |
There is another tagger program called "tree-tagger-flush" which |
|
82 |
flushes the output after reading an empty line. It expects a parameter |
|
83 |
file as argument and reads from stdin and writes to stdout. No command |
|
84 |
line options are supported. This program might be useful for |
|
85 |
implementing wrappers. |
|
95 | 86 |
|
96 | 87 |
|
88 |
|
|
97 | 89 |
Training |
98 | 90 |
-------- |
99 | 91 |
|
100 | 92 |
Training is done with the *train-tree-tagger* program. It expects at least |
101 | 93 |
four command line arguments which are described below. |
102 | 94 |
|
103 |
train-tree-tagger {options} <lexicon> <open class file> <input file> <output file> |
|
95 |
train-tree-tagger <lexicon> <open class file> <input file> <output file> |
|
96 |
{-cl <context length>} {-dtg <min. decision tree gain>} |
|
97 |
{-ecw <eq. class weight>} {-atg <affix tree gain>} {-st <sent. tag>} |
|
104 | 98 |
|
105 | 99 |
Description of the command line arguments: |
106 | 100 |
|
107 | 101 |
* <lexicon>: name of a file which contains the fullform lexicon. Each line |
108 | 102 |
of the lexicon corresponds to one word form and contains the word form |
109 |
and a sequence of tag-lemma pairs. Each tag is preceded by a tab character
|
|
110 |
and each lemma is preceded by a blank or tab character.
|
|
103 |
itself followed by a Tab character and a sequence of tag-lemma pairs.
|
|
104 |
The tags and lemmata are separated by whitespace.
|
|
111 | 105 |
Example: |
112 | 106 |
|
113 | 107 |
aback RB aback |
114 | 108 |
abacuses NNS abacus |
115 |
abandon VB abandon VBP abandon
|
|
116 |
abandoned JJ abandoned VBD abandon VBN abandon
|
|
109 |
abandon VB abandon VBP abandon
|
|
110 |
abandoned JJ abandoned VBD abandon VBN abandon
|
|
117 | 111 |
abandoning VBG abandon |
118 | 112 |
|
119 |
Attention: Ordinal and cardinal numbers which consist of digits |
|
120 |
(like 1, 13, 1278 or 2. and 75.) should not be included in the |
|
121 |
lexicon. Otherwise, the tagger will not be able to learn how to tag |
|
122 |
numbers which are not listed in the lexicon. Numbers with unusual |
|
123 |
tags should be added to the lexicon, however. If the training |
|
124 |
program reports an error because the POS tag used for numbers is |
|
125 |
unknown, you should add a lexicon entry for one number. |
|
113 |
Remark: The tagger doesn't need the lemmata actually. If you do not have |
|
114 |
the lemma information or if you do not plan to annotate corpora with |
|
115 |
lemmas, you can replace the lemma with a dummy value, e.g. "-". |
|
126 | 116 |
|
127 |
Remark: The tagger doesn't need the lemmata for tagging actually. If |
|
128 |
you do not have the lemma information or if you do not plan to |
|
129 |
annotate corpora with lemmas, you can replace the lemma with a dummy |
|
130 |
value, e.g. "-". |
|
131 |
|
|
132 |
* <open class file>: name of a file which contains a list of open class tags |
|
133 |
i.e. possible tags of unknown word forms separated by whitespace. |
|
117 |
* <open class file>: name of a file which contains a list of open class |
|
118 |
tags i.e. possible tags of unknown word forms separated by whitespace. |
|
134 | 119 |
The tagger will use this information when it encounters unknown words, |
135 | 120 |
i.e. words which are not contained in the lexicon. |
136 | 121 |
Example: (for Penn Treebank tagset) |
... | ... | |
140 | 125 |
* <input file>: name of a file which contains tagged training data. The data |
141 | 126 |
must be in one-word-per-line format. This means that each line contains |
142 | 127 |
one token and one tag in that order separated by a tabulator. |
143 |
Punctuation marks are considered as tokens and must be tagged as well. |
|
144 |
The file should neither contain empty lines nor untagged SGML markup. |
|
128 |
Punctuation marks are considered as tokens and must have been tagged as well. |
|
145 | 129 |
Example: |
146 | 130 |
|
147 | 131 |
Pierre NP |
... | ... | |
166 | 150 |
this parameter to 1. |
167 | 151 |
* -dtg <min. decision tree gain>: Threshold - If the information gain at a |
168 | 152 |
leaf node of the decision tree is below this threshold, the node is deleted. |
169 |
* -sw <weight>: A smoothing parameter, which determines how much the |
|
170 |
probability distribution of some decision tree node is smoothed with the |
|
171 |
probability distribution of the parent node. |
|
153 |
The default value is 0.7. |
|
172 | 154 |
* -ecw <eq. class weight>: weight of the equivalence class based probability |
173 |
estimates. |
|
155 |
estimates. The default is 0.15.
|
|
174 | 156 |
* -atg <affix tree gain> Threshold - If the information gain at a leaf of an |
175 | 157 |
affix tree is below this threshold, it is deleted. The default is 1.2. |
176 | 158 |
|
177 |
The accuracy of the TreeTagger usually improves, if different settings |
|
178 |
of the above parameters are tested and the best combination is chosen. |
|
179 |
|
|
180 |
|
|
181 |
Caveat: Make sure that the lexicon and the training corpus contain no |
|
182 |
extra blanks. If the word form, for instance, is followed by a blank |
|
183 |
and a tab character, the blank will be considered part of the word. |
|
184 |
|
|
159 |
The accuracy of the TreeTagger is usually slightly improved, if different |
|
160 |
settings of the above parameters are tested and the best combination is |
|
161 |
chosen. |
tmp/org.txm.treetagger.core.macosx/res/macosx/FILES (revision 2854) | ||
---|---|---|
1 | 1 |
|
2 | 2 |
This package contains the TreeTagger, a probabilistic part-of-speech |
3 |
tagger developed by Helmut Schmid. All rights are reserved by the |
|
4 |
Institute for Computational Linguistics at the University of Stuttgart. |
|
5 |
The programs have been compiled for Sun Sparcstations with SunOS operating |
|
6 |
system version 5.6 or higher. |
|
3 |
tagger developed by Helmut Schmid. All rights are reserved by Helmut |
|
4 |
Schmid. |
|
7 | 5 |
|
8 | 6 |
Files contained in this package: |
9 | 7 |
|
... | ... | |
12 | 10 |
- README How to use the tagger |
13 | 11 |
- bin/train-tree-tagger training program |
14 | 12 |
- bin/tree-tagger tagger programm |
13 |
- bin/separate-punctuation program for tokenization (used by the shell scripts) |
|
15 | 14 |
- cmd/lookup.perl Perl script for pretagging |
16 | 15 |
- doc/nemlap94.ps paper describing the TreeTagger |
17 | 16 |
- doc/sigdat95.ps paper describing the TreeTagger |
18 | 17 |
|
19 | 18 |
This package can be downloaded at |
20 |
http://www.ims.uni-stuttgart.de/Tools/DecisionTreeTagger.html
|
|
19 |
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger
|
|
21 | 20 |
|
22 | 21 |
Also available at this URL: |
23 | 22 |
- parameter files |
Formats disponibles : Unified diff