Turku Clinical TreeBank and PropBank

We have developed a dependency-annotated treebank of Finnish Intensive Care Nursing Narratives. The treebank is annotated in a minor revision of the Stanford dependency scheme (de Marneffe et al. [1,2]). A PropBank-style predicate argument annotation is built on top of the syntactic annotation, covering 90% of all verb occurrences in the corpus. The argument annotation is tightly bound to the syntax, requiring arguments to be governed by the verb. The verb framesets are defined here.

The corpus text is automatically POS-tagged using Lingsoft TWOL and Lingsoft CG morphological analyzer and constraint grammar parser by Lingsoft, Inc., adapted for clinical language.

Underlying text corpus

The text of the corpus consists of nursing notes for eight patients, amounting to roughly 2800 sentences (17000 tokens). Many of these sentences are very short, often repeated fragments. The number of unique sentences in the corpus is thus about 2000. The data was gathered within the Louhi project and is described, for example, in this paper by Suominen et al.

The corpus has been manually anonymized and contains no private patient and staff information. All names in the corpus have been changed, as well as any other statements that could be in any way linked to a single person. The anonymization was performed independently by two nursing science researchers.

License

The corpus and its annotation are released under the Creative Commons Attribution-Share Alike license. Note that this license requires you to refer to the original source of the corpus. This is best achieved by linking to this page in all online derivative works, and citing the paper by Haverinen et al. (2010) below.

Copyright

  • Corpus annotation: Copyright © 2009-2010 Filip Ginter, Katri Haverinen, Veronika Laippala, Timo Viljanen, BioNLP Group, University of Turku
  • POS tagging: Copyright © 2010, Lingsoft, Inc.
  • Corpus text: Copyright © 2005-2010, Department of Nursing Sciences (prof. Sanna Salanterä), University of Turku

Download

The corpus can be downloaded here in an easy-to-process XML format. The frameset files, in an XML format, can be downloaded here.

Query and browse the corpus online

The corpus can also be queried online here and browsed here.

Contact

Any inquiries related to the Turku Clinical TreeBank and PropBank should be directed to Katri Haverinen (kahave@utu.fi) and Filip Ginter (ginter@cs.utu.fi).

Publications

  • Haverinen, K.; Ginter, F.; Laippala, V.; Viljanen, T. & Salakoski, T.: Dependency-based PropBanking of Clinical Finnish. In Proceedings of The Fourth Linguistic Annotation Workshop (LAW IV) held at ACL2010, Uppsala, Sweden. 2010. (to appear)
  • Haverinen, K.; Ginter, F.; Laippala, V. & Salakoski, T.: Parsing Clinical Finnish: Experiments with Rule-Based and Statistical Dependency Parsers. Proceedings of NODALIDA'09, Odense, Denmark , pp. 65-72. 2009.

Acknowledgements

  • We are grateful to Heljä Lundgrén-Laine, Riitta Danielsson-Ojala and prof. Sanna Salanterä for their assistance in the anonymization of the corpus.
  • The corpus text was gathered within the Louhi project.
  • We are grateful to Lingsoft, Inc. for their collaboration on automated POS tagging of the corpus text.

Literature

For details regarding the various layers of annotation in the corpus, we refer you to the following papers and online resources:

Syntactic annotation

PropBank scheme

  • PropBank: Palmer et al. (2005) (Computational Linguistics)
  • Clinical PropBank: Haverinen et al. (2010) (LAW IV 2010, to appear)