MySQL Format of the 2010 EVEX dataset

On this page, we explain the MySQL structure of the EVEX database in detail. A first set of tables represent the text-bound data as previously distributed in the Shared Task format. Secondly, we provide 3 distinct ways of generalizing the events and their underlying network. First, we have determined canonical strings for each gene/gene product (ggp) symbol automatically extracted by BANNER, by applying affix-stripping rules and accounting for lexical variation. Then, we have defined two ways of establishing gene families (Ensembl homologous clusters and HomoloGene families) and we have linked the corresponding canonical symbols to these families, further generalizing the data and enabling homology-based event retrieval.

For more background information, we refer to the BioNLP'11 publication. Example MySQL queries can be found at the bottom of this page.

A. Event occurrences

This set of tables stores the tagged entities and events straight from the PubMed abstracts. It is a faithful representation of the content of the .a1 and .a2 files in the Shared Task format of the data.

B. Event generalizations

This set of tables stores the generalized events, i.e., abstraction from event occurrences. There are three different manners in which we generalize the events, corresponding to three different ways in which equality of ggp symbols is established: Ensembl homologous clusters, HomoloGene families, and simple canonical form equivalence. Each of these three manners has one set of the tables below. Thus the database contains tables ensembl_event, ensembl_eventargument_event,… and homologene_event, homologene_eventargument_events,… etc.

  • event: one row per generalized event
  • eventargument_ggp: one row per argument of an event, with the argument itself being a ggp (referred to by canonical symbol or by gene family ID, depending on the level of abstraction)
  • eventargument_event: one row per argument which is recursively an event
  • event_occmap: links generalized events with event occurrences
  • deriv_event_stats: provides aggregate statistics for a generalized event
  • deriv_event_ggp_depth_and_path: lists for each generalized event all GGPs that are, even recursively, involved in the event

The following two tables are only relevant for generalized events built on top of the canonical forms of gene symbols (thus not using gene families for further generalization)

Finally, both the Ensembl and the HomoloGene based-generalization link to gene family ID's that can be linked to specific Entrez Gene identifiers and a corresponding set of non-ambiguous canonical forms.

MySQL scheme

More details on each field of each table are given here.

Example queries

Retrieve all Binding events and their arguments as in the original a1/a2 release for PubMed abstract ID 14985235.

SELECT
 oee.a2id as event_a2,
 oet.a2id as trigger_a2,
 oet.string as trigger_word,
 og.a1id as arg_a1,
 oeg.role as arg_role
FROM
 occurrence_event_trigger oet,
 occurrence_ggp og,
 occurrence_eventargument_ggp oeg
RIGHT JOIN
 (SELECT * FROM occurrence_event oe WHERE oe.pmid='14985235' AND oe.type='Binding') AS oee
 ON oeg.occurrence_event_id = oee.id
WHERE
 oet.id = oee.occurrence_event_trigger_id
AND
 oeg.occurrence_ggp_id = og.id
ORDER BY event_a2 ASC;

Retrieve all original full GGP symbols extracted by BANNER that were canonicalized to the same canonical form as "Esr-1"

SELECT
 distinct(string)
FROM
 occurrence_ggp og,
 (SELECT
  occurrence_ggp_id
 FROM
  canonical_ggp cg,
  canonical_ggp_occmap cgo
 WHERE cg.canonical_form = "esr1"
 AND cg.id = cgo.canonical_ggp_id)
 as canforms
WHERE
 canforms.occurrence_ggp_id = og.id;

Retrieve all canonicalized forms linked to the same HomoloGene family as "Esr-1"

SELECT
 canonical_form
FROM
 homologene h,
 (SELECT id FROM homologene WHERE canonical_form="esr1") as i
WHERE
 h.id = i.id;

Retrieve all HomoloGene-based events that involve Esr-1 (or one of its synonyms) as a cause

SELECT
 he.id as homologene_event_id
FROM
 homologene_event he
JOIN homologene_eventargument_ggp heg on heg.homologene_event_id = he.id
WHERE
 heg.homologene_id=
 (SELECT homologene.id FROM homologene WHERE canonical_form='esr1')
AND
 heg.role="Cause";

Retrieve all canonical gene symbols that are a Theme in an event that has Esr-1 (or one of its synonyms) as a Cause, plus the event type, relying on the HomoloGene-based generalization

SELECT
 h.id as homologene_id,
 he.type
FROM
 homologene_event he
JOIN (homologene_eventargument_ggp heg1, homologene_eventargument_ggp heg2, homologene h)
ON (heg1.homologene_event_id = he.id AND heg2.homologene_event_id=he.id AND h.id=heg2.homologene_id)
WHERE
 heg1.homologene_id=(select homologene.id from homologene where canonical_form='esr1')
AND
 heg1.role="Cause"
AND
 heg2.role="Theme"
GROUP BY homologene_id, type;