Data and Tools
Corpus
For Tasks A and B we shall use the ShARe corpus containing clinical notes from MIMIC II database manually annotated for disorder mentions and normalized to an UMLS Concept Unique Identifier (CUI) when possible.
The corpus annotation guidelines contain more details and examples.
Data Format
The annotations follow a pipe-delimited stand-off character-offset format.
report name || annotation type || cui || char start || char end 08100-027513-DISCHARGE_SUMMARY.txt||Disease_Disorder||c0332799||459||473
System results should be submitted in the same format. If the annotation contains disjoint spans (i.e., non-contiguous spans, such as in the sentence "Abdomen: no distention is noted." in which the single annotation for "abdominal distention, C0235698" encompasses the span 0-6 (abdomen) and 12-21 (distention)), then additional char start and char end values will be appended as follows:
08100-027513-DISCHARGE_SUMMARY.txt||Disease_Disorder||c0332799||0||6||12||21
Trial Data
The following tarball contains four documents along with their annotations:
semeval-2014-task-7-trial.tar.gz
Training Data
Clinical data -- even in it's de-identified form -- has various privacy controls in place. In order to get the annotation along with the associated clinical notes, a participant would have to follow the following steps to ensure that they understand the ethical aspects of handling human subjects data. This training is free.
- Register to participate.This form is separate from the main SemEval registration form.
-
Obtain a human subjects training certificate.
If you do not have a certificate, you can take the CITI training course or the NIH training course - Go to the Physionet website
-
Click on the link for “creating a PhysioNetWorks account” (near middle of page)
and follow the instructions. -
Once you have a login and password, go to MIMIC II and accept the terms of a Data
Use Agreement (DUA) -
You will receive an email telling you to fill in your information on the DUA and email
it back with your human subjects training certificate. -
Fill out the DUA using the word “SemEval-2014” in the description of the project
and mail it back (pasted into the email) with your human subjects certificate attached. -
Once you are approved to use MIMIC II, you can then go to your PhysioNetWorks Home
as shown below
- Once here, you will see a list of all projects.
- Select the link to "SemEval 2014 -- Analysis of Clinical Text"
-
Apply for access to the data by clicking on the link "here" shown below
- Once you do this, the organizers will get a request to add you to the project.
-
After the organizers give you access, you will get an email informing you that you can
access the data.
UMLS Knowledge Source
In case you haven't already have access to the UMLS knowledge sources, you would have to register at the following NIH website and request a license:
https://uts.nlm.nih.gov/home.html
Once you have obtained the license you can download the UMLS release files from the following URL:
http://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html
Here you will find multiple different versions. NIH updates these about two times in a year. The version of the database used for the ShARe annotations is 2012AB. This resource contains mapping for many different terminologies. The ones that are relevant in our case are the UMLS CUIs for the SNOMED-CT terminology.