The primary data format of the 2010 release of the EVEX dataset is in the BioNLP'09 Shared Task format. For a definition of the Shared Task format, we refer to the official website of the Shared Task. Essentially, the .a1 and .a2 files are tab-delimited text files defining extracted entities and events from text by their character offsets in the original abstracts. In addition to the .a1 and .a2 files, the release contains .scores files, which define the confidence scores of the event trigger and argument predictions. These files are identical to their corresponding .a2 files except for the third column which defines the trigger/argument scores for each class. Note that overlapping triggers/arguments can be predicted through merged classes, where two classes are joined with "---".
The parses (in total 20M sentences) are released in a largely self-explanatory XML format here. The parser used was the Charniak-Johnson parser with the improved self-trained biomedical parsing model published by David McClosky.
- Sentences and tokens in the XML files use character offsets as reference (sentence character offsets refer into the .txt files in the release, and token character offsets refer into sentence text). Note that these are character offsets into unicode strings (as opposed to byte offsets into their utf-8 encoding). It is thus absolutely crucial that you work with the strings on character, not byte level because some characters do take two bytes in the utf-8 encoding.
- Only sentences with at least one entity are parsed and a tiny fraction of these were not parsed successfully and have no parse, or an empty parse
- McCloskyPenn are the original Penn-treebank formatted parses using the in-built tokenizer in the parser
- split-McClosky parses are the McCloskyPenn parses processed with the Stanford Dependency conversion tools (collapsed + cc-processed), and subsequently tokens which partially overlap with an entity are split into two at the entity boundary and dummy dependencies are introduced into the dependency parse.
- id of entities refers to their T-id in the corresponding a1 file.
- headOffset of entities is the head token of the entity as identified by a simple heuristic. This information is used by the event extraction system.