Finnish Internet Parsebank

The Finnish Internet Parsebank project

The Finnish Internet Parsebank is a joint project with the School of languages and Translation studies. It aims at producing a mass-scale corpus of Internet Finnish by using automatic syntactic analysis and document classification.

The project has three aims:

  • The creation of a language resource with automatic morphological and syntactic analyses
  • The classification of the entire Parsebank to coherent subcorpora
  • The creation of an online user interface

Online access to the Parsebank data and other demos based on the data

The Parsebank data can be queried online in several ways:

○ Lexical search (NoSketchEngine)

An online interface based on the NoSketchEngine is available here. Login name: guest Password: voikukka.

○ Syntax-based search (SETS)

The syntactic trees can be queried using the SETS system here The search currently provides a 1 million tree (Fi-Parsebank-1M) and a 50 million tree (Fi-Parsebank-50M) sample of the Parsebank data.

○ Semantic similarity of words (word2vec)

An online demo which lets you query semantically similar words using a word2vec model trained on the Parsebank data.

Downloadable resources based on the Parsebank data

Several resources based on the Parsebank data can be downloaded for offline use:

○ Raw data

The raw Parsebank data cannot be distributed as-is due to the Finnish copyright law limitations. For research purposes, we can provide the Parsebank with randomly shuffled sentences, i.e. without the document structure. This data is still useful for a large number of today's NLP tasks. You can request access by emailing to Filip Ginter and Veronika Laippala: ginter@cs.utu.fi and mavela@utu.fi.

○ N-gram data

We have generated both flat and syntactic n-gram collections from the first version of the Finnish Internet Parsebank (1.5 billion tokens). The data can be downloaded here.

The n-grams are described in this paper. See also the README file from the Google's original English n-gram collection for further infomation about the format of the syntactic n-grams.

Hosting for this data is generously provided by the EVEX project at the University of Turku. Download speeds are throttled on a per-connection basis to 1MB/sec. Please do not bypass this limit using multiple connections.

○ Vector space embeddings of words

We have also used the data to induce distributional vector space representations of the lexicon with the word2vec software. These models are available here.

Publications

  • Kanerva, Jenna; Luotolahti, Juhani; Laippala, Veronika; Ginter, Filip: Syntactic N-gram Collection from a Large-Scale Corpus of Internet Finnish. Proceedings of the Sixth International Conference Baltic HLT. 2014. paper

Funding

The project is funded by the Kone foundation as part of its Language Programme (2014-2016).

License

N-grams collections and vector space models are released under the Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License