Data and Tools
Corpus
For Tasks 1 and 2 we shall use the ShARe corpus containing clinical notes from MIMIC II database manually annotated for disorder mentions and normalized to an UMLS Concept Unique Identifier (CUI) when possible.
The corpus annotation guidelines contain more details and examples.
Data Format
The annotations follow a pipe-delimited stand-off character-offset format. The template will contain the following format:
report name|disorder-span|cui|Norm_NI|Cue_NI|Norm_SC|Cue_SC|Norm_UI|Cue_UI| Norm_CC|Cue_CC|Norm_SV|Cue_SV|Norm_CO|Cue_CO|Norm_GC|Cue_GC|Norm_BL|Cue_BL| Norm_DT|Norm_TE|Cue_TE
System results should be submitted in the same format as shown below. In addition, in order to simplify scoring, we will ask participants to format the submissions in a particular directory structure so that it would help facilitate automating the scoring process. It is harder to score multiple systems automatically when they follow different formats. We will make this structure available to you during the evaluation period. We will also provide a validation script that would identify common errors that we had noticed in the previous scoring runs, and which will make the scoring process smoother.
09388-093839-DISCHARGE_SUMMARY.txt|30-36|C0040128|*no|*NULL|*patient|*NULL|*no| *NULL|*false|*NULL| *unmarked|*NULL|severe|*NULL|*false|*NULL|C0040132|*NULL| Before|*None|*NULL
Trial Data
The following tarball contains four documents along with their annotations:
semeval-2015-task-7-trial.tar.gz
Training Data
Clinical data -- even in it's de-identified form -- has various privacy controls in place. In order to get the annotation along with the associated clinical notes, a participant would have to follow the following steps to ensure that they understand the ethical aspects of handling human subjects data. This training is free.
- Register to participate.This form is separate from the main SemEval registration form.
-
Obtain a human subjects training certificate.
If you do not have a certificate, you can take the CITI training course or the NIH training course - Go to the Physionet website
-
Click on the link for “creating a PhysioNetWorks account” (near middle of page)
and follow the instructions. -
Once you have a login and password, go to MIMIC II and accept the terms of a Data
Use Agreement (DUA) -
You will receive an email telling you to fill in your information on the DUA and email
it back with your human subjects training certificate. -
Fill out the DUA using the word “SemEval-2015” in the description of the project
and mail it back (pasted into the email) with your human subjects certificate attached. -
Once you are approved to use MIMIC II, you can then go to your PhysioNetWorks Home
as shown below
- Once here, you will see a list of all projects.
- Select the link to "SemEval 2015 -- Analysis of Clinical Text"
-
Apply for access to the data by clicking on the link "here" shown below
- Once you do this, the organizers will get a request to add you to the project.
-
After the organizers give you access, you will get an email informing you that you can
access the data.
UMLS Knowledge Source
In case you haven't already have access to the UMLS knowledge sources, you would have to register at the following NIH website and request a license:
https://uts.nlm.nih.gov/home.html
Once you have obtained the license you can download the UMLS release files from the following URL:
http://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html
Here you will find multiple different versions. NIH updates these about two times in a year. The version of the database used for the ShARe annotations is 2012AB. This resource contains mapping for many different terminologies. The ones that are relevant in our case are the UMLS CUIs for the SNOMED-CT terminology.