2011 release of the EVEX dataset

This release uses the data from the 2010 release (PubMed abstracts -2009), adding several features. First, the data is distributed as a MySQL database, facilitating easy and fast access. Second, three event generalizations are added: one on top of canonical gene symbols, and two on top of gene families defined by either Ensembl or HomoloGene. As a result, 11.2 million original biomolecular events could be linked to well-defined gene families (58%). The canonicalization algorithm extends on the methods used for producing the XML format in the 2010 release, further reducing the original set of 19.2 event occurrences to 3.2 million unique ones.


We strongly recommend that you subscribe to our low-traffic, announcement-only mailing list where we will inform of major releases and updates of the EVEX dataset. To subscribe, simply send an empty email to evexdb+subscribe@googlegroups.com.

Proceed to the download page of the 2011 release of the data.


The extracted events are licensed under the Creative Commons Attribution Share Alike license allowing free use of the data.
PubMed information contained in the event output, such as the abstract texts, is covered by the National Library of Medicine's terms.


Van Landeghem S, Ginter F, Van de Peer Y, Salakoski T (2011)
EVEX: A PubMed-scale resource for homology-based generalization of text mining predictions.
Proceedings of BioNLP 2011, Portland, Oregon. pp 28-37. ACL. [PDF][Presentation slides]