TreeTagger - a language independent part-of-speech tagger

Deutsche Version dieser Seite

The TreeTagger is a tool for annotating text with part-of-speech and lemma information which has been developed within the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Spanish, Bulgarian, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Sample output:
word  pos  lemma 
The  DT  the 
TreeTagger  NP  TreeTagger 
is  VBZ  be 
easy  JJ  easy 
to  TO  to 
use  VB  use 
SENT 

The tagger is described in the following two papers:


Download

Executable code for Sparc workstations, Linux and Windows PCs and Macs as well as parameter files for English, German, Italian, Spanish, Bulgarian, French and old French can be downloaded via the links below.

The French and the Italian parameter files are provided by Achim Stein.

The English parameter file was trained on the PENN treebank and uses the English morphological database created by Karp, Schabes, Zaidel and Egedi.

The Spanish parameter file was trained on the Spanish CRATER corpus and uses the Spanish lexicon of the CALLHOME corpus of the LDC.

The Bulgarian parameter file was trained by Julien Nioche on the Bulgarian Treebank. It uses a UTF-8 encoding.

This software is freely available for research, education and evaluation. For commercial licenses and for licenses for C programming interface, please contact Helmut Schmid (at FirstName.LastName@ims.uni-stuttgart.de).

Please read the license terms, before you download the software! By downloading the software, you agree to the terms stated there.

The following steps are necessary to install the TreeTagger (see below for the Windows version):

  1. Download the tagger package for your system (Sparc-Solaris, PC-Linux, Mac OS-X).
  2. Download the tagging scripts into the same directory.
  3. Download the parameter files for your system (Sparc-Solaris, PC, Mac).
  4. Download the installation script install-tagger.sh.
  5. Open a terminal window and run the installation script in the directory where you have downloaded the files:
    sh install-tagger.sh
  6. Make a test, e.g.
    echo 'Hello world!' | cmd/tree-tagger-english
    or
    echo 'Das ist ein Test.' | cmd/tagger-chunker-german
If you have difficulties with the installation, have a look at the installation hints (kindly provided by Joachim Wagner).

Parameter files for Sparc-Solaris and Mac OS-X (Latin1 character set)

Parameter files for PC (Linux and Windows, Latin1 character set) A Windows version of the TreeTagger is also available. The parameter files have to be downloaded separately.

Tagsets

Here is some information about the tagsets used in the parameter files:


Links