The Finnish Internet Parsebank project

The Finnish Internet Parsebank is a joint project with the School of languages and Translation studies. It aims at producing a mass-scale corpus of Internet Finnish by using automatic syntactic analysis and document classification.

The project has three aims:

  • The creation of a language resource with automatic morphological and syntactic analyses
  • The classification of the entire Parsebank to coherent subcorpora
  • The creation of an online user interface

Online access to the Parsebank data and other demos based on the data

The Parsebank data can be queried online in several ways:

○ Lexical search (NoSketchEngine)

An online interface based on the NoSketchEngine is available here. Login name: guest Password: voikukka.

○ Syntax-based search (SETS)

The syntactic trees can be queried using the SETS system here The search currently provides a 1 million tree (Fi-Parsebank-1M) and a 50 million tree (Fi-Parsebank-50M) sample of the Parsebank data.

○ Semantic similarity of words (word2vec)

An online demo which lets you query semantically similar words using a word2vec model trained on the Parsebank data.

Downloadable resources based on the Parsebank data

Several resources based on the Parsebank data can be downloaded for offline use:

○ Raw data

The raw Parsebank data cannot be distributed as-is due to the Finnish copyright law limitations. For research purposes, we can provide the Parsebank with randomly shuffled sentences, i.e. without the document structure. This data is still useful for a large number of today's NLP tasks. You can request access by emailing to Filip Ginter and Veronika Laippala: and

○ N-gram data

We have generated both flat and syntactic n-gram collections from the first version of the Finnish Internet Parsebank (1.5 billion tokens). The data can be downloaded here.

The n-grams are described in this paper. See also the README file from the Google's original English n-gram collection for further infomation about the format of the syntactic n-grams.

Hosting for this data is generously provided by the EVEX project at the University of Turku. Download speeds are throttled on a per-connection basis to 1MB/sec. Please do not bypass this limit using multiple connections.

○ Vector space embeddings of words

We have also used the data to induce distributional vector space representations of the lexicon with the word2vec software. These models are available here.


  • Kanerva, Jenna; Luotolahti, Juhani; Laippala, Veronika; Ginter, Filip: Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish. Proceedings of the Sixth International Conference Baltic HLT. 2014. paper


The project is funded by the Kone foundation as part of its Language Programme (2014-2016).


N-grams collections and vector space models are released under the Creative Commons Attribution-ShareAlike 4.0 International License.

