Clinical Finnish parser demo

We make available the online demo of the statistical parser of clinical Finnish discussed in this paper by Haverinen et al. (2009). The dependency types in the parser output are explained here. Before using the demo, you can also browse the clinical treebank and the output of the parser on its test set here to see the type of language the parser was developed for and the type of analysis it is expected to give.

Go to the demo

About the parser

The underlying parser is a statistical full dependency parser induced using the MaltParser system developed at Växjö University and Uppsala University, Sweden. The parser is trained on 2600 sentences (15500 tokens) from the anonymized clinical treebank.

The dependency parser relies on POS tags and morphological information from Lingsoft TWOL and Lingsoft CG, a Finnish morphological analyzer and constraint grammar parser adapted to clinical language, developed and kindly licensed to us by Lingsoft, Inc. These tools are used to tokenize the input text and assign to each word its baseform and a set of morphological tags. These are used as features in the dependency parser.

Parser performance

On a held-out test set of roughly 200 sentences that was not used to train the parser nor to optimize its parameters, the parser currently achieves a labeled attachment score of 81% (labeled attachment score is the proportion of tokens that are assigned the correct head word as well as depency type). The parser output on the test set, with parsing errors highlighted, is available here. It gives an idea on the general usability of the parser and helps you interpret the 81% labeled attachment score.

Current status and future plans

The parser is currently very much a research prototype to illustrate the possibilities of statistical parsing in a specialized domain, starting from a very small treebank of mere 15000 tokens. At least the following improvements are currently planned:

  • The parser is currently not able to insert automatically the null verbs used in the treebank to represent omitted main verbs and expects these on the input.
  • The parsing performace could be increased by expanding the clinical treebank, that is, annotating more text to train the parser.

Acknowledgments

Lingsoft Logo
Morphological Analysis and Constraint Grammar - Powered by Lingsoft