Data and Tools < SemEval-2015 Task 14

Data and Tools

Corpus

For Tasks 1 and 2 we shall use the ShARe corpus containing clinical notes from MIMIC II database manually annotated for disorder mentions and normalized to an UMLS Concept Unique Identifier (CUI) when possible.
The corpus annotation guidelines contain more details and examples.

Data Format

The annotations follow a pipe-delimited stand-off character-offset format. The template will contain the following format:

report name|disorder-span|cui|Norm_NI|Cue_NI|Norm_SC|Cue_SC|Norm_UI|Cue_UI|
Norm_CC|Cue_CC|Norm_SV|Cue_SV|Norm_CO|Cue_CO|Norm_GC|Cue_GC|Norm_BL|Cue_BL|
Norm_DT|Norm_TE|Cue_TE

System results should be submitted in the same format as shown below. In addition, in order to simplify scoring, we will ask participants to format the submissions in a particular directory structure so that it would help facilitate automating the scoring process. It is harder to score multiple systems automatically when they follow different formats. We will make this structure available to you during the evaluation period. We will also provide a validation script that would identify common errors that we had noticed in the previous scoring runs, and which will make the scoring process smoother.

09388-093839-DISCHARGE_SUMMARY.txt|30-36|C0040128|*no|*NULL|*patient|*NULL|*no|
*NULL|*false|*NULL| *unmarked|*NULL|severe|*NULL|*false|*NULL|C0040132|*NULL|
Before|*None|*NULL

Trial Data

The following tarball contains four documents along with their annotations:

semeval-2015-task-7-trial.tar.gz

Training Data

Clinical data -- even in it's de-identified form -- has various privacy controls in place. In order to get the annotation along with the associated clinical notes, a participant would have to follow the following steps to ensure that they understand the ethical aspects of handling human subjects data. This training is free.

Register to participate.This form is separate from the main SemEval registration form.
Obtain a human subjects training certificate.
If you do not have a certificate, you can take the CITI training course or the NIH training course
Go to the Physionet website
Click on the link for “creating a PhysioNetWorks account” (near middle of page)
and follow the instructions.
Once you have a login and password, go to MIMIC II and accept the terms of a Data
Use Agreement (DUA)
You will receive an email telling you to fill in your information on the DUA and email
it back with your human subjects training certificate.
Fill out the DUA using the word “SemEval-2015” in the description of the project
and mail it back (pasted into the email) with your human subjects certificate attached.
Once you are approved to use MIMIC II, you can then go to your PhysioNetWorks Home
as shown below
Once here, you will see a list of all projects.
Select the link to "SemEval 2015 -- Analysis of Clinical Text"
Apply for access to the data by clicking on the link "here" shown below
Once you do this, the organizers will get a request to add you to the project.
After the organizers give you access, you will get an email informing you that you can
access the data.

UMLS Knowledge Source

In case you haven't already have access to the UMLS knowledge sources, you would have to register at the following NIH website and request a license:

https://uts.nlm.nih.gov/home.html

Once you have obtained the license you can download the UMLS release files from the following URL:

http://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html

Here you will find multiple different versions. NIH updates these about two times in a year. The version of the database used for the ShARe annotations is 2012AB. This resource contains mapping for many different terminologies. The ones that are relevant in our case are the UMLS CUIs for the SNOMED-CT terminology.

SemEval-2015 Task 14

Data and Tools

Contact Info

Organizers (in alphabetical order)

Other Info