Evaluation
Participants in Clinical TempEval may participate in any or all of the 6 tasks (TS, ES, TA, EA, DR, CR). Additionally, Clinical TempEval will have a two-phase evaluation, allowing systems to start either from the plain text, or to incorporate some manual annotations. Specifically, the phases will be:
- Only the plain text is given
- Manually annotated event and time expression spans and attributes are given (i.e., manual TS, ES, TA and EA)
The evaluation metrics that will be applied for each of these phases are:
-
Only the plain text is given
- TS, ES: precision, recall and F1
- TA, EA: precision, recall and F1 for each attribute, and an overall precision, recall and F1 where a time/event is marked correct only if all attributes are correct
- DR: precision, recall and F1
- CR: precision, recall and F1, and closure-based precision, recall and F1, where temporal closure is run to infer additional relations on both the system and the reference relations and scores are calculated on the post-closure relations.
-
Manually annotated event and time expression spans and attributes are given
- DR: accuracy
- CR: precision, recall and F1, and closure-based precision, recall and F1.
The evaluation period is 05 Dec 2014 (UTC+12) - 22 Dec 2014 (UTC-12). Participants may choose to download the test data at any time during the evaluation period, but once they have downloaded it, they will have only 5 days to submit their results.
Phase 1 Submissions
If you have completed the data use agreement process with the Mayo Clinic and received the THYME corpus, then you already have the dev.zip file which contains the raw text we will be using as a test set this year. To obtain the password for this dev.zip file, submit the form at:
https://docs.google.com/forms/d/163I0lqgjs_mt0SE91E4Hi8v_-joeIHt_r01gkSTzvVk/viewform
You will receive an email after filling out this form giving you instructions on how to:
- Download the "test data" (in our case, just a text file containing the password for dev.zip)
- Submit your system description
- Upload your system output
You will have 5 days, starting from the point at which you submit the form above, to upload your system output on the text files from dev.zip.
Phase 2 Submissions
The phase 2 data includes EVENT and TIMEX3 annotations corresponding to texts from the dev.zip file. Even if you do not intend to participate in Phase 1, you must still submit the form for Phase 1 to obtain the password for the dev.zip file. To get the additional EVENT and TIMEX3 annotations, submit the form at:
https://docs.google.com/forms/d/1hbMOzFkJ0xVZu_LNNzvrQ1CBWCMwoDBcpYir3ZqveoI/viewform
As with phase 1, you will receive an email after filling out this form giving you instructions on how to download the data, submit your system description, and upload your system output. You will have 5 days, starting from the point at which you submit the form above, to upload your system output on the text files from dev.zip.
If you plan to participate in both phase 1 and phase 2, you must submit your phase 1 results before downloading the phase 2 data or your system will be disqualified.
System Output Format
The format of submissions is the same for both phase 1 and phase 2. Your system output should take the same format and organization as the Anafora XML files in the training data. Your directory structure should look like:
-
SystemName-RunName
-
ID004_clinic_010
- ID004_clinic_010.Temporal-Relation.system.completed.xml
-
ID004_clinic_012
- ID004_clinic_012.Temporal-Relation.system.completed.xml
-
ID004_path_011
- ID004_path_011.Temporal-Relation.system.completed.xml
-
ID005_clinic_013
- ID005_clinic_013.Temporal-Relation.system.completed.xml
- ...
-
ID004_clinic_010
Before uploading your results, please check that your Anafora XML files are valid and are read correctly by the evaluation script.
Note that you only need to submit system output on the ID* files as we will only be evaluating on colon cancer notes this year. However, there's no harm if you include system output for the other files - they will be automatically ignored during the official evaluation.