Turku Dependency Treebank (TDT)

Suomeksi/In Finnish

TDT is a broad-coverage dependency-annotated treebank of general Finnish. The treebank is annotated in the Universal Dependencies scheme to which is was converted from its original annotation in a minor revision of the Stanford dependency scheme (de Marneffe et al. [1,2]). The primary purpose of the treebank is to support Finnish NLP.

News

January 2015
The treebank has been fully integrated into the Universal Dependencies project. The authoritative version of TDT is now there.
September 27, 2013
An updated version of the syntax annotation manual for the latest treebank release is now available.
July 29, 2013
New release of the treebank is now available for download. It is accompanied by a paper in the Language Resources and Evaluation journal.

Schedule and milestones

The release currently available for download through the Universal Dependencies project consists of 181K tokens and 13.5K sentences. Further 20K tokens are kept as a secret test set and will be used for a shared task at a later stage.

Treebank text and license

The treebank consists of a number of different sections. The copyright of the texts remains to their authors (see the list of text sources). All sections of the treebank, as well as the annotation, are released under the Creative Commons Attribution-Share Alike license. Note that this license requires you to refer to the original source of the treebank. This is best achieved by linking to this page in all online derivative works, and citing the paper by Haverinen et al. (2013) below.

Download

The treebank should be downloaded from the Universal Dependencies project page. The original download files are for historical interest here.

Browse and query the treebank online

The treebank can be queried online using the SETS system.

Contact

Any inquiries related to the treebank should be directed to Filip Ginter (ginter@cs.utu.fi).

Publications

  • Main reference: Haverinen, K.; Nyblom, J.; Viljanen, T.; Laippala, V.; Kohonen, S.; Missilä, A.; Ojala, S.; Salakoski, T.; Ginter, F.: Building the essential resources for Finnish: the Turku Dependency Treebank. Language Resources and Evaluation. 2013. DOI: 10.1007/s10579-013-9244-1
  • Haverinen, K.; Ginter, F.; Laippala, V.; Kohonen, S.; Viljanen, T.; Nyblom, J. & Salakoski, T.: A Dependency-based Analysis of Treebank Annotation Errors. Proceedings of International Conference on Dependency Linguistics (Depling'11), Barcelona, Spain , pp. 115-124. 2011. [PDF]
  • Haverinen, K.; Viljanen, T.; Laippala, V.; Kohonen, S.; Ginter, F. & Salakoski, T.: Treebanking Finnish. Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. 2010. [PDF]
  • Haverinen, K.; Ginter, F.; Laippala, V.; Viljanen, T. & Salakoski, T.: Dependency Annotation of Wikipedia: First Steps towards a Finnish Treebank. Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8). 2009. [PDF]

Documentation

The annotation scheme of the treebank is described in detail in the following technical report:

Haverinen, K.: Syntax Annotation Guidelines for the Turku Dependency Treebank - 2nd edition, revised for the treebank release of July 2013. Technical report 1034, Turku Centre for Computer Science. January 2012. [PDF]

and also extensively documented in the Finnish section of the Universal Dependencies

Acknowledgments

We are grateful for the funding received so far from:

We thank all the authors who kindly allowed us to include their texts into the treebank, either by explicit permission, or by releasing their text under an open license in the first place.