Turku Dependency Treebank (TDT)

Suomeksi/In Finnish

We are building a broad-coverage dependency-annotated treebank of general Finnish. The treebank is annotated in a minor revision of the Stanford dependency scheme (de Marneffe et al. [1,2]). The primary purpose of the treebank is to support Finnish NLP.

News

Feb 23, 2012
New release of the treebank is now available with the data used in our DepLing'11 paper, and a technical report that documents the annotation scheme is also now available
May 23, 2011
TDT is now available for download also in the CoNLL format
Dec 1, 2010
New release of the treebank is now available, containing 4307 sentences (58k tokens, 313 documents)
Jun 18, 2010
The treebank can now be queried online
Jan 14, 2010
The software for viewing, searching, and editing the treebank was released. See this link.

Schedule and milestones

The release currently available for download (as of February 2012) comprises 415 documents in the publicly available set. The treebank annotation is now complete and we are preparing the final release.

Treebank text and license

The treebank consists of a number of different sections. The copyright of the texts remains to their authors (see the list of text sources). All sections of the treebank, as well as the annotation, are released under the Creative Commons Attribution-Share Alike license. Note that this license requires you to refer to the original source of the treebank. This is best achieved by linking to this page in all online derivative works, and citing the paper by Haverinen et al. (2011) below.

Download

The treebank can be downloaded here in an XML format as well as the CoNLL-X format.

Browse and query the treebank online

The treebank can be queried online. A static browseable version of the treebank is available here.

Contact

Any inquiries related to the treebank should be directed to Katri Haverinen (kahave@utu.fi) and Filip Ginter (ginter@cs.utu.fi).

Publications

  • Haverinen, K.; Ginter, F.; Laippala, V.; Kohonen, S.; Viljanen, T.; Nyblom, J. & Salakoski, T.: A Dependency-based Analysis of Treebank Annotation Errors. Proceedings of International Conference on Dependency Linguistics (Depling'11), Barcelona, Spain , pp. 115-124. 2011. [PDF]
  • Haverinen, K.; Viljanen, T.; Laippala, V.; Kohonen, S.; Ginter, F. & Salakoski, T.: Treebanking Finnish. Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. 2010. [PDF]
  • Haverinen, K.; Ginter, F.; Laippala, V.; Viljanen, T. & Salakoski, T.: Dependency Annotation of Wikipedia: First Steps towards a Finnish Treebank. Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8). 2009. [PDF]

Documentation

The annotation scheme of the treebank is described in detail in the following technical report:

Haverinen, K.: Syntax Annotation Guidelines for the Turku Dependency Treebank. Technical report 1034, Turku Centre for Computer Science. January 2012. [PDF]

Acknowledgments

We are grateful for the funding received so far from:

We thank all the authors who kindly allowed us to include their texts into the treebank, either by explicit permission, or by releasing their text under an open license in the first place. Full list here. Kiitos!