The Finnish Internet Parsebank is a joint project with the School of languages and Translation studies. It aims at producing a mass-scale corpus of Internet Finnish by using automatic syntactic analysis and document classification.
The project has three aims:
The Parsebank data can be queried online in several ways:
An online interface based on the NoSketchEngine is available here. Login name: guest Password: voikukka.
The syntactic trees can be queried using the SETS system here The search currently provides a 1 million tree (Fi-Parsebank-1M) and a 50 million tree (Fi-Parsebank-50M) sample of the Parsebank data.
An online demo which lets you query semantically similar words using a word2vec model trained on the Parsebank data.
Several resources based on the Parsebank data can be downloaded for offline use:
The raw Parsebank data cannot be distributed as-is due to the Finnish copyright law limitations. For research purposes, we can provide the Parsebank with randomly shuffled sentences, i.e. without the document structure. This data is still useful for a large number of today's NLP tasks. You can request access by emailing to Filip Ginter and Veronika Laippala: ginter@cs.utu.fi and mavela@utu.fi.
We have generated both flat and syntactic n-gram collections from the first version of the Finnish Internet Parsebank (1.5 billion tokens). The data can be downloaded here.
The n-grams are described in this paper. See also the README file from the Google's original English n-gram collection for further infomation about the format of the syntactic n-grams.
Hosting for this data is generously provided by the EVEX project at the University of Turku. Download speeds are throttled on a per-connection basis to 1MB/sec. Please do not bypass this limit using multiple connections.
We have also used the data to induce distributional vector space representations of the lexicon with the word2vec software. These models are available here.
The project is funded by the Kone foundation as part of its Language Programme (2014-2016).
N-grams collections and vector space models are released under the Creative Commons Attribution-ShareAlike 4.0 International License.