 1 2 3 826 mdecorde 2854 mdecorde /*****************************************************************************/  2854 mdecorde /* How to use the TreeTagger */  2854 mdecorde /*****************************************************************************/  826 mdecorde 826 mdecorde 826 mdecorde The TreeTagger consists of two programs: train-tree-tagger is used to  826 mdecorde create a parameter file from a lexicon and a handtagged corpus.  826 mdecorde tree-tagger expects a parameter file and a text file as arguments and  826 mdecorde annotates the text with part-of-speech tags. The file formats are  826 mdecorde described below. By default, the programs are located in the ./bin  826 mdecorde sub-directory.  826 mdecorde 826 mdecorde If either of the programs is called without arguments, it will print  826 mdecorde information about its usage.  826 mdecorde 826 mdecorde 826 mdecorde Tagging  826 mdecorde -------  826 mdecorde 826 mdecorde Tagging is done with the tree-tagger program. It requires at least one  826 mdecorde command line argument, the parameter file. If no input file is specified,  826 mdecorde input will be read from stdin. If neither an input file nor an output file  826 mdecorde is specified, the tagger will print to stdout.  826 mdecorde 2854 mdecorde tree-tagger {-eps }  2854 mdecorde  {-base} {-proto} {-sgml} {-token} {-lemma} {-beam }  826 mdecorde 826 mdecorde Description of the command line arguments:  826 mdecorde 826 mdecorde * : Name of a parameter file which was created with the  826 mdecorde  train-tree-tagger program.  826 mdecorde * : Name of the file which is to be tagged. Each token in this  2854 mdecorde  file must be on a separate line. Tokens may contain blanks. It is possible  826 mdecorde  to override the lexical information contained in the parameter file of the  826 mdecorde  tagger by specifying a list of possible tags after a token. This list has  2854 mdecorde  to be preceded by a tab character. The tags are optionally followed by a  2854 mdecorde  floating point value to specify the probability of the tag. Adding such  2854 mdecorde  tag information in the tagger's input is sometimes useful to ensure that  826 mdecorde  certain text-specific expressions are tagged properly.  826 mdecorde  Punctuation marks must be on separate lines as well. Clitics (like "'s",  826 mdecorde  "'re", and "'d" in English or "-la" and "-t-elle" in French) should be  826 mdecorde  separated if they were separated in the training data. (The French and  2854 mdecorde  English parameter files available by ftp, expect separation of clitics).  826 mdecorde  Sample input file:  826 mdecorde  He  826 mdecorde  moved  826 mdecorde  to  2854 mdecorde  New York City NP 1.0  826 mdecorde  .  826 mdecorde * : Name of the file to which the tagger should write its output.  826 mdecorde 826 mdecorde Further optional command line arguments:  826 mdecorde 826 mdecorde * -token: tells the tagger to print the words also.  826 mdecorde * -lemma: tells the tagger to print the lemmas of the words also.  826 mdecorde * -sgml: tells the tagger to ignore tokens starting with '<' and ending  826 mdecorde  with '>' (SGML tags).  826 mdecorde 2854 mdecorde The options below are for advanced users. Read the papers on the TreeTagger  2854 mdecorde to fully understand their meaning.  826 mdecorde 826 mdecorde * -proto: If this option is specified, the tagger creates a file named  826 mdecorde  "lexicon-protocol.txt", which contains information about the degree of  826 mdecorde  ambiguity and about the other possible tags of a word form. The part of  826 mdecorde  the lexicon in which the word form has been found is also indicated. 'f'  826 mdecorde  means fullform lexicon and 's' means affix lexicon. 'h' means that the  826 mdecorde  word contains a hyphen and that the part of the word following the  826 mdecorde  hyphen has been found in the fullform lexicon.  826 mdecorde * -eps : Value which is used to replace zero lexical frequencies.  826 mdecorde  This is the case if a word/tag pair is contained in the lexicon but not  2854 mdecorde  in the training corpus. The default is 0.1. The choice of this parameter  2854 mdecorde  has some minor influence on tagging accuracy.  2854 mdecorde * -beam : If the tagger is slow, this option can be used to speed it up.  2854 mdecorde  Good values for are in the range 0.001-0.00001.  826 mdecorde * -base: If this option is specified, only lexical information is used  826 mdecorde  for tagging but no contextual information about the preceding tags.  826 mdecorde  This option is only useful in order to obtain a baseline result  826 mdecorde  to which to compare the actual tagger output.  826 mdecorde 2854 mdecorde There is another tagger program called "tree-tagger-flush" which  2854 mdecorde flushes the output after reading an empty line. It expects a parameter  2854 mdecorde file as argument and reads from stdin and writes to stdout. No command  2854 mdecorde line options are supported. This program might be useful for  2854 mdecorde implementing wrappers.  826 mdecorde 826 mdecorde 2854 mdecorde 826 mdecorde Training  826 mdecorde --------  826 mdecorde 826 mdecorde Training is done with the *train-tree-tagger* program. It expects at least  826 mdecorde four command line arguments which are described below.  826 mdecorde 2854 mdecorde train-tree-tagger  2854 mdecorde  {-cl } {-dtg }  2854 mdecorde  {-ecw } {-atg } {-st }  826 mdecorde 826 mdecorde Description of the command line arguments:  826 mdecorde 826 mdecorde * : name of a file which contains the fullform lexicon. Each line  826 mdecorde  of the lexicon corresponds to one word form and contains the word form  2854 mdecorde  itself followed by a Tab character and a sequence of tag-lemma pairs.  2854 mdecorde  The tags and lemmata are separated by whitespace.  826 mdecorde  Example:  826 mdecorde 826 mdecorde aback RB aback  826 mdecorde abacuses NNS abacus  2854 mdecorde abandon VB abandon VBP abandon  2854 mdecorde abandoned JJ abandoned VBD abandon VBN abandon  826 mdecorde abandoning VBG abandon  826 mdecorde 2854 mdecorde  Remark: The tagger doesn't need the lemmata actually. If you do not have  2854 mdecorde  the lemma information or if you do not plan to annotate corpora with  2854 mdecorde  lemmas, you can replace the lemma with a dummy value, e.g. "-".  826 mdecorde 2854 mdecorde * : name of a file which contains a list of open class  2854 mdecorde  tags i.e. possible tags of unknown word forms separated by whitespace.  826 mdecorde  The tagger will use this information when it encounters unknown words,  826 mdecorde  i.e. words which are not contained in the lexicon.  826 mdecorde  Example: (for Penn Treebank tagset)  826 mdecorde 826 mdecorde FW JJ JJR JJS NN NNS NP NPS RB RBR RBS VB VBD VBG VBN VBP VBZ  826 mdecorde 826 mdecorde * : name of a file which contains tagged training data. The data  826 mdecorde  must be in one-word-per-line format. This means that each line contains  826 mdecorde  one token and one tag in that order separated by a tabulator.  2854 mdecorde  Punctuation marks are considered as tokens and must have been tagged as well.  826 mdecorde  Example:  826 mdecorde 826 mdecorde Pierre NP  826 mdecorde Vinken NP  826 mdecorde , ,  826 mdecorde 61 CD  826 mdecorde years NNS  826 mdecorde 826 mdecorde * : name of the file in which the resulting tagger parameters  826 mdecorde  are stored.  826 mdecorde 826 mdecorde The following parameters are optional. Read the papers on the TreeTagger to  826 mdecorde fully understand their meaning.  826 mdecorde 826 mdecorde * -st : the end-of-sentence part-of-speech tag, i.e. the tag which  826 mdecorde  is assigned to sentence punctuation like ".", "!", "?".  826 mdecorde  Default is "SENT". It is important to set this option properly, if your  826 mdecorde  tag for sentence punctuation is not "SENT".  826 mdecorde * -cl : number of preceding words forming the statistical  826 mdecorde  context. The default is 2 which corresponds to a trigram context. For  826 mdecorde  small training corpora and/or large tagsets, it could be useful to reduce  826 mdecorde  this parameter to 1.  826 mdecorde * -dtg : Threshold - If the information gain at a  826 mdecorde  leaf node of the decision tree is below this threshold, the node is deleted.  2854 mdecorde  The default value is 0.7.  826 mdecorde * -ecw : weight of the equivalence class based probability  2854 mdecorde  estimates. The default is 0.15.  826 mdecorde * -atg Threshold - If the information gain at a leaf of an  826 mdecorde  affix tree is below this threshold, it is deleted. The default is 1.2.  826 mdecorde 2854 mdecorde The accuracy of the TreeTagger is usually slightly improved, if different  2854 mdecorde settings of the above parameters are tested and the best combination is  2854 mdecorde chosen.