Statistics
| Revision:

## root / tmp / org.txm.treetagger.core.macosx / res / macosx / README @ 1120

 1 2 3 826 mdecorde 826 mdecorde /***************************************************************************/  826 mdecorde /* How to use the TreeTagger */  826 mdecorde /* Author: Helmut Schmid, University of Stuttgart, Germany */  826 mdecorde /***************************************************************************/  826 mdecorde 826 mdecorde 826 mdecorde The TreeTagger consists of two programs: train-tree-tagger is used to  826 mdecorde create a parameter file from a lexicon and a handtagged corpus.  826 mdecorde tree-tagger expects a parameter file and a text file as arguments and  826 mdecorde annotates the text with part-of-speech tags. The file formats are  826 mdecorde described below. By default, the programs are located in the ./bin  826 mdecorde sub-directory.  826 mdecorde 826 mdecorde If either of the programs is called without arguments, it will print  826 mdecorde information about its usage.  826 mdecorde 826 mdecorde 826 mdecorde Tagging  826 mdecorde -------  826 mdecorde 826 mdecorde Tagging is done with the tree-tagger program. It requires at least one  826 mdecorde command line argument, the parameter file. If no input file is specified,  826 mdecorde input will be read from stdin. If neither an input file nor an output file  826 mdecorde is specified, the tagger will print to stdout.  826 mdecorde 826 mdecorde tree-tagger {-options-} { {}}  826 mdecorde 826 mdecorde Description of the command line arguments:  826 mdecorde 826 mdecorde * : Name of a parameter file which was created with the  826 mdecorde  train-tree-tagger program.  826 mdecorde * : Name of the file which is to be tagged. Each token in this  826 mdecorde  file has to be on a separate line. Tokens may contain blanks. It is possible  826 mdecorde  to override the lexical information contained in the parameter file of the  826 mdecorde  tagger by specifying a list of possible tags after a token. This list has  826 mdecorde  to be preceded by a tab character and the elements are separated by tab  826 mdecorde  characters. This pretagging feature could be used e.g. to ensure that  826 mdecorde  certain text-specific expressions are tagged properly.  826 mdecorde  Punctuation marks must be on separate lines as well. Clitics (like "'s",  826 mdecorde  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be  826 mdecorde  separated if they were separated in the training data. (The French and  826 mdecorde  English parameter files available by ftp expect separation of clitics).  826 mdecorde  Sample input file:  826 mdecorde  He  826 mdecorde  moved  826 mdecorde  to  826 mdecorde  New York City NP  826 mdecorde  .  826 mdecorde * : Name of the file to which the tagger should write its output.  826 mdecorde 826 mdecorde Further optional command line arguments:  826 mdecorde 826 mdecorde * -token: tells the tagger to print the words also.  826 mdecorde * -lemma: tells the tagger to print the lemmas of the words also.  826 mdecorde * -sgml: tells the tagger to ignore tokens starting with '<' and ending  826 mdecorde  with '>' (SGML tags).  826 mdecorde - -no-unknown: If an unknown word is encountered, emit the word form  826 mdecorde  as lemma. This was previously the default behaviour. Now, the default  826 mdecorde  behaviour is to print "" as lemma.  826 mdecorde - -threshold
: This option tells the tagger to print all tags of a  826 mdecorde  word with a probability higher than
times the largest probability.  826 mdecorde  (The tagger will use a different algorithm in this case and the set of  826 mdecorde  best tags might be different from the tags generated without this  826 mdecorde  option.)  826 mdecorde - -prob: Print tag probabilities (in combination with option -threshold)  826 mdecorde - -pt-with-prob: If this option is specified, then each pretagging tag  826 mdecorde  (see above) has to be followed by a whitespace and a tag probability  826 mdecorde  value.  826 mdecorde - -pt-with-lemma: If this option is specified, then each pretagging tag  826 mdecorde  (see above) has to be followed by a whitespace and a lemma. Lemmas may  826 mdecorde  contain blanks.  826 mdecorde  If both -pt-with-prob and -pt-with-lemma have been specified, then each  826 mdecorde  pretagging tag is followed by a probability and a lemma in that order.  826 mdecorde 826 mdecorde The options below are for advanced users. Please, read the papers on the  826 mdecorde TreeTagger to fully understand their meaning.  826 mdecorde 826 mdecorde * -proto: If this option is specified, the tagger creates a file named  826 mdecorde  "lexicon-protocol.txt", which contains information about the degree of  826 mdecorde  ambiguity and about the other possible tags of a word form. The part of  826 mdecorde  the lexicon in which the word form has been found is also indicated. 'f'  826 mdecorde  means fullform lexicon and 's' means affix lexicon. 'h' means that the  826 mdecorde  word contains a hyphen and that the part of the word following the  826 mdecorde  hyphen has been found in the fullform lexicon.  826 mdecorde * -eps : Value which is used to replace zero lexical frequencies.  826 mdecorde  This is the case if a word/tag pair is contained in the lexicon but not  826 mdecorde  in the training corpus. The choice of this parameter has only minor  826 mdecorde  influence on the tagging accuracy.  826 mdecorde * -base: If this option is specified, only lexical information is used  826 mdecorde  for tagging but no contextual information about the preceding tags.  826 mdecorde  This option is only useful in order to obtain a baseline result  826 mdecorde  to which to compare the actual tagger output.  826 mdecorde 826 mdecorde 826 mdecorde 826 mdecorde Training  826 mdecorde --------  826 mdecorde 826 mdecorde Training is done with the *train-tree-tagger* program. It expects at least  826 mdecorde four command line arguments which are described below.  826 mdecorde 826 mdecorde train-tree-tagger {options}  826 mdecorde 826 mdecorde Description of the command line arguments:  826 mdecorde 826 mdecorde * : name of a file which contains the fullform lexicon. Each line  826 mdecorde  of the lexicon corresponds to one word form and contains the word form  826 mdecorde  and a sequence of tag-lemma pairs. Each tag is preceded by a tab character  826 mdecorde  and each lemma is preceded by a blank or tab character.  826 mdecorde  Example:  826 mdecorde 826 mdecorde aback RB aback  826 mdecorde abacuses NNS abacus  826 mdecorde abandon VB abandon VBP abandon  826 mdecorde abandoned JJ abandoned VBD abandon VBN abandon  826 mdecorde abandoning VBG abandon  826 mdecorde 826 mdecorde  Attention: Ordinal and cardinal numbers which consist of digits  826 mdecorde  (like 1, 13, 1278 or 2. and 75.) should not be included in the  826 mdecorde  lexicon. Otherwise, the tagger will not be able to learn how to tag  826 mdecorde  numbers which are not listed in the lexicon. Numbers with unusual  826 mdecorde  tags should be added to the lexicon, however. If the training  826 mdecorde  program reports an error because the POS tag used for numbers is  826 mdecorde  unknown, you should add a lexicon entry for one number.  826 mdecorde 826 mdecorde  Remark: The tagger doesn't need the lemmata for tagging actually. If  826 mdecorde  you do not have the lemma information or if you do not plan to  826 mdecorde  annotate corpora with lemmas, you can replace the lemma with a dummy  826 mdecorde  value, e.g. "-".  826 mdecorde 826 mdecorde * : name of a file which contains a list of open class tags  826 mdecorde  i.e. possible tags of unknown word forms separated by whitespace.  826 mdecorde  The tagger will use this information when it encounters unknown words,  826 mdecorde  i.e. words which are not contained in the lexicon.  826 mdecorde  Example: (for Penn Treebank tagset)  826 mdecorde 826 mdecorde FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ  826 mdecorde 826 mdecorde * : name of a file which contains tagged training data. The data  826 mdecorde  must be in one-word-per-line format. This means that each line contains  826 mdecorde  one token and one tag in that order separated by a tabulator.  826 mdecorde  Punctuation marks are considered as tokens and must be tagged as well.  826 mdecorde  The file should neither contain empty lines nor untagged SGML markup.  826 mdecorde  Example:  826 mdecorde 826 mdecorde Pierre NP  826 mdecorde Vinken NP  826 mdecorde , ,  826 mdecorde 61 CD  826 mdecorde years NNS  826 mdecorde 826 mdecorde * : name of the file in which the resulting tagger parameters  826 mdecorde  are stored.  826 mdecorde 826 mdecorde The following parameters are optional. Read the papers on the TreeTagger to  826 mdecorde fully understand their meaning.  826 mdecorde 826 mdecorde * -st : the end-of-sentence part-of-speech tag, i.e. the tag which  826 mdecorde  is assigned to sentence punctuation like ".", "!", "?".  826 mdecorde  Default is "SENT". It is important to set this option properly, if your  826 mdecorde  tag for sentence punctuation is not "SENT".  826 mdecorde * -cl : number of preceding words forming the statistical  826 mdecorde  context. The default is 2 which corresponds to a trigram context. For  826 mdecorde  small training corpora and/or large tagsets, it could be useful to reduce  826 mdecorde  this parameter to 1.  826 mdecorde * -dtg : Threshold - If the information gain at a  826 mdecorde  leaf node of the decision tree is below this threshold, the node is deleted.  826 mdecorde * -sw : A smoothing parameter, which determines how much the  826 mdecorde  probability distribution of some decision tree node is smoothed with the  826 mdecorde  probability distribution of the parent node.  826 mdecorde * -ecw : weight of the equivalence class based probability  826 mdecorde  estimates.  826 mdecorde * -atg Threshold - If the information gain at a leaf of an  826 mdecorde  affix tree is below this threshold, it is deleted. The default is 1.2.  826 mdecorde 826 mdecorde The accuracy of the TreeTagger usually improves, if different settings  826 mdecorde of the above parameters are tested and the best combination is chosen.  826 mdecorde 826 mdecorde 826 mdecorde Caveat: Make sure that the lexicon and the training corpus contain no  826 mdecorde extra blanks. If the word form, for instance, is followed by a blank  826 mdecorde and a tab character, the blank will be considered part of the word.