MySQL Format of the 2010 EVEX dataset

On this page, we explain the MySQL structure of the EVEX database in detail. A first set of tables represent the text-bound data as previously distributed in the Shared Task format. Secondly, we provide 3 distinct ways of generalizing the events and their underlying network. First, we have determined canonical strings for each gene/gene product (ggp) symbol automatically extracted by BANNER, by applying affix-stripping rules and accounting for lexical variation. Then, we have defined two ways of establishing gene families (Ensembl homologous clusters and HomoloGene families) and we have linked the corresponding canonical symbols to these families, further generalizing the data and enabling homology-based event retrieval.

For more background information, we refer to the BioNLP'11 publication. Example MySQL queries can be found at the bottom of this page.

A. Event occurrences

This set of tables stores the tagged entities and events straight from the PubMed abstracts. It is a faithful representation of the content of the .a1 and .a2 files in the Shared Task format of the data.

B. Event generalizations

This set of tables stores the generalized events, i.e., abstraction from event occurrences. There are three different manners in which we generalize the events, corresponding to three different ways in which equality of ggp symbols is established: Ensembl homologous clusters, HomoloGene families, and simple canonical form equivalence. Each of these three manners has one set of the tables below. Thus the database contains tables ensembl_event, ensembl_eventargument_event,… and homologene_event, homologene_eventargument_events,… etc.

  • event: one row per generalized event
  • eventargument_ggp: one row per argument of an event, with the argument itself being a ggp (referred to by canonical symbol or by gene family ID, depending on the level of abstraction)
  • eventargument_event: one row per argument which is recursively an event
  • event_occmap: links generalized events with event occurrences
  • deriv_event_stats: provides aggregate statistics for a generalized event
  • deriv_event_ggp_depth_and_path: lists for each generalized event all GGPs that are, even recursively, involved in the event

The following two tables are only relevant for generalized events built on top of the canonical forms of gene symbols (thus not using gene families for further generalization)

Finally, both the Ensembl and the HomoloGene based-generalization link to gene family ID's that can be linked to specific Entrez Gene identifiers and a corresponding set of non-ambiguous canonical forms.

MySQL scheme

More details on each field of each table are given here.

Example queries

Retrieve all Binding events and their arguments as in the original a1/a2 release for PubMed abstract ID 14985235.

 oee.a2id as event_a2,
 oet.a2id as trigger_a2,
 oet.string as trigger_word,
 og.a1id as arg_a1,
 oeg.role as arg_role
 occurrence_event_trigger oet,
 occurrence_ggp og,
 occurrence_eventargument_ggp oeg
 (SELECT * FROM occurrence_event oe WHERE oe.pmid='14985235' AND oe.type='Binding') AS oee
 ON oeg.occurrence_event_id =
WHERE = oee.occurrence_event_trigger_id
 oeg.occurrence_ggp_id =
ORDER BY event_a2 ASC;

Retrieve all original full GGP symbols extracted by BANNER that were canonicalized to the same canonical form as "Esr-1"

 occurrence_ggp og,
  canonical_ggp cg,
  canonical_ggp_occmap cgo
 WHERE cg.canonical_form = "esr1"
 AND = cgo.canonical_ggp_id)
 as canforms
 canforms.occurrence_ggp_id =;

Retrieve all canonicalized forms linked to the same HomoloGene family as "Esr-1"

 homologene h,
 (SELECT id FROM homologene WHERE canonical_form="esr1") as i

Retrieve all HomoloGene-based events that involve Esr-1 (or one of its synonyms) as a cause

SELECT as homologene_event_id
 homologene_event he
JOIN homologene_eventargument_ggp heg on heg.homologene_event_id =
 (SELECT FROM homologene WHERE canonical_form='esr1')

Retrieve all canonical gene symbols that are a Theme in an event that has Esr-1 (or one of its synonyms) as a Cause, plus the event type, relying on the HomoloGene-based generalization

SELECT as homologene_id,
 homologene_event he
JOIN (homologene_eventargument_ggp heg1, homologene_eventargument_ggp heg2, homologene h)
ON (heg1.homologene_event_id = AND AND
 heg1.homologene_id=(select from homologene where canonical_form='esr1')
GROUP BY homologene_id, type;