Data

Data Use Agreements

All notes that are part of the Clinical TempEval corpus have been de-identified as much as possible while still leaving the dates intact. To access the source texts for the Clinical TempEval corpus, participants must agree to handle the data appropriately, formalized in the requirement that participants must sign a data use agreement with the Mayo Clinic. To request a data use agreement, please follow the instructions at the THYME website.

Please apply for a data use agreement as soon as possible! The process may take some time.

Please read the DUA carefully before agreeing to it. Among other things, you will be agreeing:

  • to keep the data secure using restricted passwords and encryption (e.g., on a secure server, not on personal computers)
  • not to attempt to re-identify the data
  • not to redistribute the data to anyone else for any purpose

Note: if you have already completed the data use agreement process (e.g., for Clinical TempEval 2015), you do not need to complete it again for Clinical TempEval 2016. You already have the train.zip, dev.zip and test.zip files that you will need for the shared task.

Releases of Source Text

Once you have obtained a data use agreement, you will be given the THYME corpus, containing three zip archives: train.zip, dev.zip and test.zip. These contain the source texts for Clinical TempEval. The zip files are password protected, and passwords will be released on the Clinical TempEval Google Group. The train.zip contains the training data, and the dev.zip contains the testing data. The train.zip and dev.zip passwords have been posted to the group. The test.zip password will be released at the time that the evaluation begins.

Releases of Annotations

The annotations for the Clinical TempEval data will be released on the THYME data Github account. The following releases are currently available:

  • Trial data: 1.1.0+coloncancer-dev. This is the training and test set from Clinical TempEval 2015.
  • Test data, EVENTs and TIMEX3s only: 1.2.0+coloncancer-test-event-time. This is the input for phase 2 of Clinical TempEval 2016, including only the EVENT and TIMEX3 annotations, and excluding the EVENT DocTimeRel annotations.
  • Test data, full: 1.3.0+coloncancer-test. This is the complete test set for Clinical TempEval 2016, including EVENT, TIMEX3, and TLINK annotations.

Corpus Annotation

The Clinical TempEval corpus has been annotated using the guidelines for the annotation of times, events and temporal relations in clinical notes, an extension of ISO TimeML developed by the THYME project. A corpus of clinical notes and pathology reports from the Mayo Clinic was annotated as follows:

  1. Annotators identified time and event expressions, along with their attributes (except normalized values for time expressions)
  2. Adjudicators revised and finalized the time and event expressions and their attributes
  3. Annotators identified temporal relations between pairs of events and between events and times
  4. Adjudicators revised and finalize the temporal relations
  5. The Pheme project annotated normalized values for each of the time expressions.

Data Format

The Clinical TempEval data is in Anafora format. This means that for each file in the corpus, there will be a directory. That directory will contain a plain text file and an XML file. The XML file contains stand-off annotations indicating where each of the events, times and temporal relations have been identified in the text. For example, given the text:

The XML file will contain something like the Anafora file here.

Other Info

Announcements