TDT is a broad-coverage dependency-annotated treebank of general Finnish. The treebank is annotated in a minor revision of the Stanford dependency scheme (de Marneffe et al. [1,2]). The primary purpose of the treebank is to support Finnish NLP.
The release currently available for download (as of July 2013) comprises 678 documents in the publicly available set and 76 in the held-out test set. The syntax annotation is complete with this release. PropBank-style annotation of TDT is currently in progress.
The treebank consists of a number of different sections. The copyright of the texts remains to their authors (see the list of text sources). All sections of the treebank, as well as the annotation, are released under the Creative Commons Attribution-Share Alike license. Note that this license requires you to refer to the original source of the treebank. This is best achieved by linking to this page in all online derivative works, and citing the paper by Haverinen et al. (2013) below.
The treebank can be downloaded here in an XML format as well as the CoNLL-X format.
To facilitate testing and result comparison, the annotation of the TDT test set is held out. Use this service to test your parser output.
The annotation scheme of the treebank is described in detail in the following technical report:
Haverinen, K.: Syntax Annotation Guidelines for the Turku Dependency Treebank - 2nd edition, revised for the treebank release of July 2013. Technical report 1034, Turku Centre for Computer Science. January 2012. [PDF]
We are grateful for the funding received so far from:
We thank all the authors who kindly allowed us to include their texts into the treebank, either by explicit permission, or by releasing their text under an open license in the first place. Full list here. Kiitos!