Data

Data Use Agreements

All notes that are part of the Clinical TempEval corpus have been de-identified as much as possible while still leaving the dates intact. To access the source texts for the Clinical TempEval corpus, participants must agree to handle the data appropriately, formalized in the requirement that participants must sign a data use agreement with the Mayo Clinic. To request a data use agreement, please follow the instructions at the THYME data website.

Please apply for a data use agreement as soon as possible! The process may take some time.

Please read the DUA carefully before agreeing to it. Among other things, you will be agreeing:

  • to keep the data secure using restricted passwords and encryption (e.g., on a secure server, not on personal computers)
  • not to attempt to re-identify the data
  • not to redistribute the data to anyone else for any purpose

Releases of Source Text

Once you have obtained a data use agreement, you will be given the THYME corpus, containing three zip archives: train.zip, dev.zip and test.zip. These contain the source texts for Clinical TempEval. The zip files are password protected, and passwords will be released on the Clinical TempEval Google Group. The train.zip contains the training data, and the dev.zip contains the testing data. The train.zip password has been posted to the group. The dev.zip password will be released at the time that the evaluation begins. The test.zip files will not be used in this year's Clinical TempEval.

Releases of Annotations

The annotations for the Clinical TempEval data will be released on the THYME data Github account. The following releases are currently available:

Corpus Annotation

The Clinical TempEval corpus has been annotated using the guidelines for the annotation of times, events and temporal relations in clinical notes, an extension of ISO TimeML developed by the THYME project. A corpus of clinical notes and pathology reports from the Mayo Clinic was annotated as follows:

  1. Annotators identified time and event expressions, along with their attributes (except normalized values for time expressions)
  2. Adjudicators revised and finalized the time and event expressions and their attributes
  3. Annotators identified temporal relations between pairs of events and between events and times
  4. Adjudicators revised and finalize the temporal relations
  5. The Pheme project annotated normalized values for each of the time expressions.

Data Format

The Clinical TempEval data is in Anafora format. This means that for each file in the corpus, there will be a directory. That directory will contain a plain text file and an XML file. The XML file contains stand-off annotations indicating where each of the events, times and temporal relations have been identified in the text. For example, given the text:

The XML file will contain something like the Anafora file here.