Query language

This page documents the search expression language which is used to query the online versions of the Turku Dependency Treebank, treebanks in Universal Dependencies collections and the Finnish Internet Parsebank (search online), as well as any other treebank indexed using the dependency tree search by University of Turku (download sources).

The query language is loosely inspired by TGrep2 and TRegex, but is specifically designed for querying general dependency graphs, rather than constituency trees. In particular, the underlying search engine handles non-tree structures, including directed cycles. Also, the language allows queries for rich morphological tagsets. The basic target of a query is a word with possible restriction on its dependent and governor structures which can be recursively restricted upon as well.

All expression examples below are links that search through either English or Finnish treebank in the Universal Dependencies collection.

Token specification

Tokens with particular word form are searched by typing the token text as-is. Examples: If the searched text conflicts with a know morphological tag, the text is interpreted to mean the tag. To search for the actual text instead, the text must be written in quotation marks:
  • "Person" searches for literal text Person and not the tag Person
Base form (lemma) is given with the L= prefix.
  • L=olla searches for all forms of the Finnish verb olla
POS and other morphological tags can be specified by writing the tags as-is. If the tags are in form of Category=Tag, only tag part must be written. However, if multiple categories have the same tag value, the tag is mapped into the most frequent category. To seach other possible categories, also the category name must be specified.
  • NOUN searches for all tokens with the POS tag NOUN
  • Par searches for all tokens in partitive case (Note: Par is interpreted to mean Case=Par)
  • VerbForm=Inf searches for all infinitives
  • Past searches for all past tense verbs (Note: Past is interpreted to mean Tense=Past. Other possible category for Past is PartForm, and to search for past participles PartForm=Past must be typed.)
Also the whole categories can be searched. This is done by typing just the plain category name the same way than the tag values are used.
  • PartForm searches for all participles: present (PartForm=Pres), past (PartForm=Past), agentive (PartForm=Agt) and negative (PartForm=Neg)
The full set of categories and tags used in any supported corpus can be found under the Show types link on the main page (see e.g. English and Czech).

It is also possible to combine all above token specification with AND (&) and OR (|) operators: Word forms, lemmas and tags can also be negated by typing the negation operator ! before the property.
  • can&!AUX searches for the word can when it is not an auxiliary
  • !Tra searches for words which are not in translative case
  • voi&!L=voida searches for the word voi when the lemma is not voida
Token can be left unspecified by typing an underscore character (_).

Dependency specification

Dependencies are expressed using < and > operators. These operators mimick "arrows" in the dependency graph.
  • token1 < token2 token1 is governed by token2
  • token1 > token2 token1 governs token2
The underscore character _ stands for any token, that is, a token on which we place no particular restrictions. Here are simple examples of basic search expressions that restrict dependency structures:

  • walk < _ searches for all cases of walk which are governed by some word
  • walk > _ searches for all cases of walk which govern a word
  • _ < walk searches for any token governed by walk
Note that the left-most token in the expression is always the target of the search and also identified in search results. While queries _ <nsubj _ and _ >nsubj _ match the excact same graphs, returned tokens differ.

The dependency type can be specified typing it right after the dependency operator: _ <type _ or _ >type _. The | character denotes a logical or, so any of the given dependency relations will match.
  • _ <cop _ searches for all copula verbs (are governed through a cop dependency)
  • _ >nsubj:cop _ searches for all words serving as a predicative in copula structures (govern a copula subject)
  • _ <nsubj|<nsubj:cop _ searches for all words serving as a subject - both standard and copula subject
You can specify a number of dependency restrictions at a time by chaining the operators: Priority is marked using parentheses:

  • _ >nmod _ >nmod _ searches for words that govern two distinct nominal modifiers (two nommod dependencies in parallel)
  • _ >nmod (_ >nmod _) searches for words that govern a nominal modifier which, in turn governs another nominal modifier (chain of two nmod dependencies)
  • NOUN >amod (_ >amod|>acl _) searches for nouns that govern an adjectival modifier, where the adjectival modifier itself governs either another adjectival modifier or a participial modifier
Negation is marked using the negation operator !. Currently, you can negate the < and > operators as well as dependency types as follows:
  • _ >nsubj:cop _ !>cop _ searches for all copula predicatives (governors of nsubj:cop dependents) that do not have a copula verb (do not govern a cop dependent)
  • _ <advcl _ !>mark _ searches for heads of unmarked adverbial clauses (governed by advcl but not governing mark)
  • _ <nsubj _ !(>amod|>acl) _ searches for subjects which do not govern adjectival or participial modifiers
  • _ <nsubj _ >!amod _ searches for subjects which governs something but it cannot be an adjective (governed by nsubj and governs something which is not amod)
Note that in _ !>amod _ it is accepted that the token does not have any dependent, whereas in _ >!amod _ the token must have at least one dependent which is not amod.

Direction of the dependency relation can be specified using operators @R and @L, where the operator means that the right-most token of the expression must be at the right side or at the left side, respectively.
  • VERB >nsubj@R _ searches for verbs which have nsubj dependent to the right
  • _ >amod@L _ >amod@R _ searches for words that have two distinct adjectival modifiers (two amod dependencies in parallel), one must be at the left side, the other at the right side
  • _ <case@R _ searches for case markers where the governor token is at the right side, i.e. prepositions (as compared to postpositions)

Combining queries

Several queries can be combined with the + operator. A query of the form query1 + query2 + query3 returns all trees which independently satisfy all three queries.

Universal quantifcation

An expression of the form _ -> NOUN means "every token (_) must be a NOUN" or in other words being a token implies being a NOUN. This is a form of universal quantification. Full tree specifications are allowed on both the left and the right side of the expression, so for example NOUN -> NOUN <acl:relcl _ means "all nouns are governed by acl:relcl"

Examples

Here we give some additional examples, all of which are "clickable" and search for the given expression in the currently released version of the general Finnish treebank.