- Finnish NLP
- Clinical NLP
On this page, we explain the MySQL structure of the EVEX database in detail. A first set of tables represent the text-bound data as previously distributed in the Shared Task format. Secondly, we provide 3 distinct ways of generalizing the events and their underlying network. First, we have determined canonical strings for each gene/gene product (ggp) symbol automatically extracted by BANNER, by applying affix-stripping rules and accounting for lexical variation. Then, we have defined two ways of establishing gene families (Ensembl homologous clusters and HomoloGene families) and we have linked the corresponding canonical symbols to these families, further generalizing the data and enabling homology-based event retrieval.
For more background information, we refer to the BioNLP'11 publication. Example MySQL queries can be found at the bottom of this page.
This set of tables stores the tagged entities and events straight from the PubMed abstracts. It is a faithful representation of the content of the .a1 and .a2 files in the Shared Task format of the data.
This set of tables stores the generalized events, i.e., abstraction from event occurrences. There are three different manners in which we generalize the events, corresponding to three different ways in which equality of ggp symbols is established: Ensembl homologous clusters, HomoloGene families, and simple canonical form equivalence. Each of these three manners has one set of the tables below. Thus the database contains tables ensembl_event, ensembl_eventargument_event,… and homologene_event, homologene_eventargument_events,… etc.
The following two tables are only relevant for generalized events built on top of the canonical forms of gene symbols (thus not using gene families for further generalization)
Finally, both the Ensembl and the HomoloGene based-generalization link to gene family ID's that can be linked to specific Entrez Gene identifiers and a corresponding set of non-ambiguous canonical forms.
More details on each field of each table are given here.
Retrieve all Binding events and their arguments as in the original a1/a2 release for PubMed abstract ID 14985235.
oee.a2id as event_a2,
oet.a2id as trigger_a2,
oet.string as trigger_word,
og.a1id as arg_a1,
oeg.role as arg_role
(SELECT * FROM occurrence_event oe WHERE oe.pmid='14985235' AND oe.type='Binding') AS oee
ON oeg.occurrence_event_id = oee.id
oet.id = oee.occurrence_event_trigger_id
oeg.occurrence_ggp_id = og.id
ORDER BY event_a2 ASC;
Retrieve all original full GGP symbols extracted by BANNER that were canonicalized to the same canonical form as "Esr-1"
WHERE cg.canonical_form = "esr1"
AND cg.id = cgo.canonical_ggp_id)
canforms.occurrence_ggp_id = og.id;
Retrieve all canonicalized forms linked to the same HomoloGene family as "Esr-1"
(SELECT id FROM homologene WHERE canonical_form="esr1") as i
h.id = i.id;
Retrieve all HomoloGene-based events that involve Esr-1 (or one of its synonyms) as a cause
he.id as homologene_event_id
JOIN homologene_eventargument_ggp heg on heg.homologene_event_id = he.id
(SELECT homologene.id FROM homologene WHERE canonical_form='esr1')
Retrieve all canonical gene symbols that are a Theme in an event that has Esr-1 (or one of its synonyms) as a Cause, plus the event type, relying on the HomoloGene-based generalization
h.id as homologene_id,
JOIN (homologene_eventargument_ggp heg1, homologene_eventargument_ggp heg2, homologene h)
ON (heg1.homologene_event_id = he.id AND heg2.homologene_event_id=he.id AND h.id=heg2.homologene_id)
heg1.homologene_id=(select homologene.id from homologene where canonical_form='esr1')
GROUP BY homologene_id, type;