Statistics
| Revision:

## root / tmp / org.txm.treetagger.core.macosx / res / macosx / README @ 2854

 1 /*****************************************************************************/  /* How to use the TreeTagger */  /*****************************************************************************/  The TreeTagger consists of two programs: train-tree-tagger is used to  create a parameter file from a lexicon and a handtagged corpus.  tree-tagger expects a parameter file and a text file as arguments and  annotates the text with part-of-speech tags. The file formats are  described below. By default, the programs are located in the ./bin  sub-directory.  If either of the programs is called without arguments, it will print  information about its usage.  Tagging  -------  Tagging is done with the tree-tagger program. It requires at least one  command line argument, the parameter file. If no input file is specified,  input will be read from stdin. If neither an input file nor an output file  is specified, the tagger will print to stdout.  tree-tagger {-eps }   {-base} {-proto} {-sgml} {-token} {-lemma} {-beam }  Description of the command line arguments:  * : Name of a parameter file which was created with the   train-tree-tagger program.  * : Name of the file which is to be tagged. Each token in this   file must be on a separate line. Tokens may contain blanks. It is possible   to override the lexical information contained in the parameter file of the   tagger by specifying a list of possible tags after a token. This list has   to be preceded by a tab character. The tags are optionally followed by a   floating point value to specify the probability of the tag. Adding such   tag information in the tagger's input is sometimes useful to ensure that   certain text-specific expressions are tagged properly.   Punctuation marks must be on separate lines as well. Clitics (like "'s",   "'re", and "'d" in English or "-la" and "-t-elle" in French) should be   separated if they were separated in the training data. (The French and   English parameter files available by ftp, expect separation of clitics).   Sample input file:   He   moved   to   New York City NP 1.0   .  * : Name of the file to which the tagger should write its output.  Further optional command line arguments:  * -token: tells the tagger to print the words also.  * -lemma: tells the tagger to print the lemmas of the words also.  * -sgml: tells the tagger to ignore tokens starting with '<' and ending   with '>' (SGML tags).  The options below are for advanced users. Read the papers on the TreeTagger  to fully understand their meaning.  * -proto: If this option is specified, the tagger creates a file named   "lexicon-protocol.txt", which contains information about the degree of   ambiguity and about the other possible tags of a word form. The part of   the lexicon in which the word form has been found is also indicated. 'f'   means fullform lexicon and 's' means affix lexicon. 'h' means that the   word contains a hyphen and that the part of the word following the   hyphen has been found in the fullform lexicon.  * -eps : Value which is used to replace zero lexical frequencies.   This is the case if a word/tag pair is contained in the lexicon but not   in the training corpus. The default is 0.1. The choice of this parameter   has some minor influence on tagging accuracy.  * -beam : If the tagger is slow, this option can be used to speed it up.   Good values for are in the range 0.001-0.00001.  * -base: If this option is specified, only lexical information is used   for tagging but no contextual information about the preceding tags.   This option is only useful in order to obtain a baseline result   to which to compare the actual tagger output.  There is another tagger program called "tree-tagger-flush" which  flushes the output after reading an empty line. It expects a parameter  file as argument and reads from stdin and writes to stdout. No command  line options are supported. This program might be useful for  implementing wrappers.  Training  --------  Training is done with the *train-tree-tagger* program. It expects at least  four command line arguments which are described below.  train-tree-tagger   {-cl } {-dtg }   {-ecw } {-atg } {-st }  Description of the command line arguments:  * : name of a file which contains the fullform lexicon. Each line   of the lexicon corresponds to one word form and contains the word form   itself followed by a Tab character and a sequence of tag-lemma pairs.   The tags and lemmata are separated by whitespace.   Example:  aback RB aback  abacuses NNS abacus  abandon VB abandon VBP abandon  abandoned JJ abandoned VBD abandon VBN abandon  abandoning VBG abandon   Remark: The tagger doesn't need the lemmata actually. If you do not have   the lemma information or if you do not plan to annotate corpora with   lemmas, you can replace the lemma with a dummy value, e.g. "-".  * : name of a file which contains a list of open class   tags i.e. possible tags of unknown word forms separated by whitespace.   The tagger will use this information when it encounters unknown words,   i.e. words which are not contained in the lexicon.   Example: (for Penn Treebank tagset)  FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ  * : name of a file which contains tagged training data. The data   must be in one-word-per-line format. This means that each line contains   one token and one tag in that order separated by a tabulator.   Punctuation marks are considered as tokens and must have been tagged as well.   Example:  Pierre NP  Vinken NP  , ,  61 CD  years NNS  * : name of the file in which the resulting tagger parameters   are stored.  The following parameters are optional. Read the papers on the TreeTagger to  fully understand their meaning.  * -st : the end-of-sentence part-of-speech tag, i.e. the tag which   is assigned to sentence punctuation like ".", "!", "?".   Default is "SENT". It is important to set this option properly, if your   tag for sentence punctuation is not "SENT".  * -cl : number of preceding words forming the statistical   context. The default is 2 which corresponds to a trigram context. For   small training corpora and/or large tagsets, it could be useful to reduce   this parameter to 1.  * -dtg : Threshold - If the information gain at a   leaf node of the decision tree is below this threshold, the node is deleted.   The default value is 0.7.  * -ecw : weight of the equivalence class based probability   estimates. The default is 0.15.  * -atg Threshold - If the information gain at a leaf of an   affix tree is below this threshold, it is deleted. The default is 1.2.  The accuracy of the TreeTagger is usually slightly improved, if different  settings of the above parameters are tested and the best combination is  chosen.