The Finnish Internet Parsebank is a joint project with the School of languages and Translation studies. It aims at producing a mass-scale corpus of Internet Finnish by using automatic syntactic analysis and document classification.
The project has three aims:
The Parsebank data can be queried online in several ways:
An online interface based on the NoSketchEngine is available here. Login name: guest Password: voikukka.
The syntactic trees can be queried using the SETS system here The search currently provides a 1 million tree (Fi-Parsebank-1M) and a 50 million tree (Fi-Parsebank-50M) sample of the Parsebank data.
An online demo which lets you query semantically similar words using a word2vec model trained on the Parsebank data.
Several resources based on the Parsebank data can be downloaded for offline use:
The raw Parsebank data cannot be distributed as-is due to the Finnish copyright law limitations. For research purposes, we can provide the Parsebank with randomly shuffled sentences, i.e. without the document structure. This data is still useful for a large number of today's NLP tasks. You can request access by emailing to Filip Ginter and Veronika Laippala: firstname.lastname@example.org and email@example.com.
We have generated both flat and syntactic n-gram collections from the first version of the Finnish Internet Parsebank (1.5 billion tokens). The data can be downloaded here.
Hosting for this data is generously provided by the EVEX project at the University of Turku. Download speeds are throttled on a per-connection basis to 1MB/sec. Please do not bypass this limit using multiple connections.
The project is funded by the Kone foundation as part of its Language Programme (2014-2016).
N-grams collections and vector space models are released under the Creative Commons Attribution-ShareAlike 4.0 International License.