- Finnish NLP
- Clinical NLP
We have developed a dependency-annotated treebank of Finnish Intensive Care Nursing Narratives. The treebank is annotated in a minor revision of the Stanford dependency scheme (de Marneffe et al. [1,2]). A PropBank-style predicate argument annotation is built on top of the syntactic annotation, covering 90% of all verb occurrences in the corpus. The argument annotation is tightly bound to the syntax, requiring arguments to be governed by the verb. The verb framesets are defined here.
The corpus text is automatically POS-tagged using Lingsoft TWOL and Lingsoft CG morphological analyzer and constraint grammar parser by Lingsoft, Inc., adapted for clinical language.
The text of the corpus consists of nursing notes for eight patients, amounting to roughly 2800 sentences (17000 tokens). Many of these sentences are very short, often repeated fragments. The number of unique sentences in the corpus is thus about 2000. The data was gathered within the Louhi project and is described, for example, in this paper by Suominen et al.
The corpus has been manually anonymized and contains no private patient and staff information. All names in the corpus have been changed, as well as any other statements that could be in any way linked to a single person. The anonymization was performed independently by two nursing science researchers.
The corpus and its annotation are released under the Creative Commons Attribution-Share Alike license. Note that this license requires you to refer to the original source of the corpus. This is best achieved by linking to this page in all online derivative works, and citing the paper by Haverinen et al. (2010) below.
For details regarding the various layers of annotation in the corpus, we refer you to the following papers and online resources: