SemEval-2015 Task 14: Analysis of Clinical Text

We describe two tasks—named entity recognition (Task 1) and template slot ﬁlling (Task 2)—for clinical texts. The tasks leverage annotations from the ShARe corpus, which consists of clinical notes with annotated mentions disorders, along with their normalization to a medical terminology and eight additional attributes. The purpose of these tasks was to identify advances in clinical named entity recognition and establish the state of the art in disorder template slot ﬁlling. Task 2 consisted of two subtasks: template slot ﬁll-ing given gold-standard disorder spans (Task 2a) and end-to-end disorder span identiﬁcation together with template slot ﬁlling (Task 2b). For Task 1 (disorder span detection and normalization), 16 teams participated. The best system yielded a strict F1-score of 75.7, with a precision of 78.3 and recall of 73.2. For Task 2a (template slot ﬁlling given gold-standard disorder spans), six teams participated. The best system yielded a combined overall weighted accuracy for slot ﬁlling of 88.6. For Task 2b (disorder recognition and template slot ﬁlling), nine teams participated. The best system yielded a combined relaxed F (for span detection) and overall weighted accuracy of 80.8.

sists of clinical notes with annotated mentions disorders, along with their normalization to a medical terminology and eight additional attributes. The purpose of these tasks was to identify advances in clinical named entity recognition and establish the state of the art in disorder template slot filling. Task 2 consisted of two subtasks: template slot filling given gold-standard disorder spans (Task 2a) and end-to-end disorder span identification together with template slot filling (Task 2b). For Task 1 (disorder span detection and normalization), 16 teams participated. The best system yielded a strict F1-score of 75.7, with a precision of 78.3 and recall of 73.2. For Task 2a (template slot filling given goldstandard disorder spans), six teams participated. The best system yielded a combined overall weighted accuracy for slot filling of 88.6. For Task 2b (disorder recognition and template slot filling), nine teams participated. The best system yielded a combined relaxed F (for span detection) and overall weighted accuracy of 80.8.

Introduction
Patient records are abundant with reports, narratives, discussions, and updates about patients. This unstructured part of the record is dense with mentions of clinical entities, such as conditions, anatomical sites, medications, and procedures. Identifying the different entities discussed in a patient record, their status towards the patient, and how they relate to each other is one of the core tasks of clinical natural language processing. Indeed, with robust systems to extract such mentions, along with their associated attributes in the text (e.g., presence of negation for a given entity mention), several high-level applications can be developed such as information extraction, question answering, and summarization.
In biomedicine, there are rich lexicons that can be leveraged for the task of named entity recognition and entity linking or normalization. The Unified Medical Language System (UMLS) represents over 130 lexicons/thesauri with terms from a variety of languages. The UMLS Metathesaurus integrates standard resources such as SNOMED-CT, ICD9, and RxNORM that are used worldwide in clinical care, public health, and epidemiology. In addition, the UMLS also provides a semantic network in which every concept in the Metathesaurus is represented by its Concept Unique Identifier (CUI) and is semantically typed (Bodenreider and McCray, 2003).
The SemEval-2015 Task 14, Analysis of Clinical Text is the newest iteration in a series of community challenges organized around the tasks of named entity recognition for clinical texts. In SemEval-2014Task 7 (Pradhan et al., 2014) and previous challenge 2013(Pradhan et al., 2013, we had focused on the task of named entity recognition for disorder mentions in clinical texs, along with normalization to UMLS CUIs. This year, we shift focus on the task of identifying a series of attributes describing a disorder mention. Like for previous challenges, we use the ShARe corpus 1 and introduce a new set of annotations for disorder attributes. In the remainder of this paper, we describe the dataset and the annotations provided to the task participants, the subtasks comprising the overall task, and the results of the teams that participated along with notable approaches in their systems. The dataset used is the ShARe corpus (Pradhan et al., 2015). As a whole, it consists of 531 deidentified clinical notes (a mix of discharge summaries and radiology reports) selected from the MIMIC II clinical database version 2.5 (Saeed et al., 2002). Part of the ShARe corpus was released as part of Semeval 2014 Task 7. In fact, to enable meaningful comparisons of systems performance across years, the 2015 Se-mEval training set combines the 2014 training and development sets, while the 2015 SemEval development set consists of the 2014 test set. The 2015 test set is a previously unseen set of clinical notes from the ShARe corpus. Table 2 provides descriptive statistics about the different sets. In addition to the ShARe corpus annotations, task participants were provided with a large set of unlabeled deidentified clinical notes, also from MIMIC II (400,000+ notes).
The ShARe corpus contains gold-standard annotations of disorder mentions and a set of attributes, as described in Table 2. We refer to the nine attributes as a disorder template. The annotation schema for the template was derived from the established clinical element model 2 . The complete guidelines for the ShARe annotations are available on the ShARe website 3 . Here, we provide a few examples to illustrate what each attribute captures.  • In the statement "patient denies numbness," the disorder numbness has an associated negation attribute set to "yes." • In the sentence "son has schizophrenia", the disorder schizophrenia has a subject attribute set to "family member." • The sentence "Evaluation of MI." contains a disorder (MI) with the uncertainty attribute set to "yes". • An example of disorder with a non-default course attribute can be found in the sentence "The cough got worse over the next two weeks.", where its value is "worsened." • The severity attribute is set to "slight" in "He has slight bleeding."  • In the sentence "Pt should come back if any rash occurs," the disorder rash has a conditional attribute with value "true." • In the sentence "Patient has a facial rash", the body location associated with the disorder "facial rash" is "face" with CUI C0015450. Note that the body location does not have to be a substring of the disorder mention, even though in this example it is. The ShARe corpus was annotated following a rigorous process. Annotators were professional coders who trained for the specific task of ShARe annotations. The annotation process consisted of a double annotation step followed by an adjudication phase. For all annotations, in addition to all the values for the attributes, their corresponding character spans in the text were recorded and are available as part of the ShARe annotations. Table 3 shows the distribution of the different attributes in the training and development sets.

Tasks
The Analysis of Clinical Text Task is split into two tasks, one on named entity recognition, and one on template slot filling for the named entities. Participants were able to submit to either or both tasks.

Task 1: Disorder Identification
For task 1, disorder identification, the goal is to recognize the span of a disorder mention in input clinical text and to normalize the disorder to a unique CUI in the UMLS/SNOMED-CT terminology. The UMLS/SNOMED-CT terminology is defined as the set of CUIs in the UMLS, but restricted to concepts that are included in the SNOMED-CT terminology.
Participants were free to use any publicly available resources, such as UMLS, WordNet, and Wikipedia, as well as the large corpus of unannotated clinical notes.
The following are examples of input/output for Task 1.
1 In "The rhythm appears to be atrial fibrillation." the span "atrial fibrillation" is the gold-standard disorder, and its normalization is CUI C0004238 (preferred term atrial fibrillation). This is a 2 In "The left atrium is moderately dilated." the disorder span is discontiguous: "left atrium...dilated" and its normalization is CUI C0344720 (preferred term left atrial dilatation). 3 In "53 year old man s/p fall from ladder." the disorder is "fall from ladder" and is normalized to C0337212 (preferred term accidental fall from ladder).
Example 1 represents the easiest cases. Example 2 represents instances of disorders as listed in the UMLS that are best mapped to discontiguous mentions. In Example 3, one has to infer that the description is a synonym of the UMLS preferred term. Finally, in some cases, a disorder mention is present, but there is no good equivalent CUI in UMLS/SNOMED-CT. The disorder is then normalized to "CUI-less".

Task 2: Disorder Slot Filling
This task focuses on identifying the normalized value for the nine attributes described above: the CUI of the disorder (very much like in Task 1), negation indicator, subject, uncertainty indicator, course, severity, conditional, generic indicator, and body location. We describe Task 2 as a slot-filling task: given a disorder mention (either provided by gold-standard or identified automatically) in a clinical note, identify the normalized value of the nine slots. Note that there are two aspects to slot filling: cues in the text and normalized value. In this task, we focus on normalized value and ignore cue detection.
To understand the state of the art for this new task, we considered two subtasks. In both cases, given a disorder span, participants are asked to identify the nine attributes related to the disorder. In Task 2a, the gold-standard disorder span(s) are provided as input. In Task 2b, no gold-standard information is provided; systems must recognize spans for disorder mentions and fill in the value of the nine attributes.

Task 1 Evaluation Metrics
Evaluation for Task 1 is reported according to a Fscore, that captures both the disorder span recognition and the CUI normalization steps. We compute two versions of the F-score: • Strict F-score: a predicted mention is considered a true positive if (i) the character span of the disorder is exactly the same as for the gold-standard mention; and (ii) the predicted CUI is correct. The predicted disorder is considered a false positive if the span is incorrect or the CUI is incorrect. • Relaxed F-score: a predicted mention is a true positive if (i) there is any word overlap between the predicted mention span and the gold-standard span (both in the case of contiguous and discontiguous spans); and (ii) the predicted CUI is correct. The predicted mention is a false positive if the span shares no words with the gold-standard span or the CUI is incorrect. Thus, given, D tp , the number of true positives disorder mentions, D f p , the number of false positive disorder mentions, and D f n , the number of false negative disorder mentions

Task 2 Evaluation Metrics
We introduce a variety of evaluation metrics, which capture different aspects of the task of disorder template slot filling. Overall, for Task 2a, we reported average unweighted accuracy, weighted accuracy, and per-slot weighted accuracy for each of the nine slots. For Task 2b, we report the same metrics, and in addition report relaxed F for span identification. We now describe per-disorder evaluation metrics, and then describe the overall evaluation metrics which provide aggregated system assessment. Given the K slots (s 1 , ..., s K ) to fill (in our task the nine different slots), each slot s k has n k possible normalized values (s i k )i ∈ 1..n k . For a given disorder, its gold-standard value for slot s k is denoted gs k , and its predicted value is denoted ps k .

Per-Disorder Evaluation Metrics
Per-disorder unweighted accuracy The unweighted accuracy represents the ability of a system to identify all the slot values for a given disorder. The per-disorder unweighted accuracy is simply defined as: K k=1 I(gs k , ps k ) K where I is the identity function: I(x, y) = 1 if x = y and 0 otherwise.
Per-disorder weighted accuracy The weighted per-disorder accuracy takes into account the prevalence of different values for each of the slots. This metric captures how good a system is at identifying rare values of different slots. The weights are thus defined as follows: • The CUI slot's weight is set to 1, for all CUI values. • The body location slot's weight is defined as weight(NULL) = 1-prevalence(NULL), and the weight for any non-NULL value (including CUIless) is set to weight(CUI) = 1-prevalence(body location with a non-NULL value).
• For each other slot s k , we define n k weights weight(s i k ) (one for each of its possible normalized values) as follows: where prevalence(s i k ) is the prevalence of value s i k in the overall corpus(training, development, and testing sets). The weights are such that highly prevalent values have smaller weights and rare values have bigger weight.
Thus, weighted per-disorder accuracy is defined as K k=1 weight(gs k ) * I(gs k , ps k ) where, like above, gs k is the gold-standard value of slot s k and ps k is the predicted value of slot s k , and I is the identity function: I(x, y) = 1 if x = y and 0 otherwise.

Overall Evaluation Metrics
Weighted and Unweighted Accuracy. Armed with the per-disorder unweighted and weighted accuracy scores, we can compute an average across all true-positive disorders. For task 2a, the disorders are provided, so they are all true positive, but for task 2b, it is important to note that we only consider the truepositive disorders to compute the overall accuracy.
Per-Slot Accuracy. Per-slot accuracy are useful in assessing the ability of a system to fill in a particular slot. For each slot, an average per-slot accuracy is defined as the accuracy for each true-positive disorder to recognize the value for that particular slot across the true-positive spans. Thus, for slot s k , the per-slot accuracy is: where for each true-positive span there is a goldstandard value gs i,k and a predicted value ps i,k for slot s k . Disorder Span Identification. This overall metric is only meaningful for Task 2b, where the system has to identify disorders prior to filling in their templates. Like in Task 1, we report an F-score metric to assess how good the system is at identifying disorder span. Note that unlike in Task 1, this F score does not consider CUI normalization, as this is captured through the accuracy in the template filling task. Thus, a true disorder span is defined as any overalp with a gold-stand disorder span. In the case of several predicted spans that overlap with a gold-standard span, then only one of them is chosen to be true positive (the longest ones), and the other predicted spans are considered false positives.  (Pathak et al., 2015). For disorder span recognition, most teams used a CRF-based approach. Features explored included traditional NER features: lexical (bag of words and bigrams, orthographic features), syntactic features derived from either part-of-speech and phrase chunking information or dependency parsing, and domain features (note type and section headers of clinical note). Lookup to dictionary (either UMLS or customized lexicon of disorders) was an essential feature for performance. To leverage further these lexicons, for instance, Xu and colleagues (Xu et al., 2015) implemented a vector-space model similarity computation to known disorders as an additional feature in their appraoch.
The best-performing teams made use of the large unannotated corpus of clinical notes provided in the challenge (Pathak et al., 2015;Leal et al., 2015;Xu et al., 2015). Teams explored the use of Brown clusters (Brown et al., 1992) and word embeddings (Collobert et al., 2011). Pathak and colleagues (Pathak et al., 2015) note that word2vec (Mikolov et al., 2013) did not yield satisfactory results. Instead, they report better results clustering sentences in the unannotated texts based on their sequence of part-of-speech tags, and using the clusters as feature in the CRF.
Teams continued to explore approaches for recognizing discontiguous entities. Pathak and colleagues (Pathak et al., 2015), for instance, built a specialized SVM-based classifier for that purpose.
For CUI normalization, the best performing teams focused on augmenting existing dictionaries with lists of unambiguous abbreviations (Leal et al., 2015) or by pre-processing UMLS and breaking down existing lexical variants to account for high paraphrasing power of disorder terms (Pathak et al., 2015).

Task 2
Six teams participated in Task 2a. Evaluation metrics are reported in Figure 2. We relied on the Weighted Accuracy (WA) to rank the teams (highlighted in the Figure is F*WA, but since in Task 2a gold-standard disorders are provided, F is 1). The best system (team UTH-CCB) yielded a WA of 88.6 (Xu et al., 2015).
For Task 2b, nine teams participated. Evaluation metrics are reported in Figure 3. We relied on the combination of F score for disorder span identification and Weighted Accuracy for template filling to rank the teams (F*WA in the figure). The best system (team UTH-CCB) yielded a F*WA of 80.8. Approaches to template filling focused on building classifiers for each attribute. Specialized lexicons of trigger terms for each attribute (e.g., list of negation terms) along with distance to disorder spans was a helpful feature. Overall, like in Task 1, a range of feature types from lexical to syntactic proved useful in the template filling task.
The per-slot accuracies (columns BL, CUI, CND, COU, GEN, NEG, SEV, SUB, and UNC in Figures 2  and 3) indicate that overall some attributes are easier to recognize than others. Body Location, perhaps not surprisingly, was the most difficult after CUI normalization, in part because it also requires a normalization to an anatomical site.

Conclusion
In this task, we introduced a new version of the ShARe corpus, with annotations of disorders and a wide set of disorder attributes. The biggest improvements in the task of disorder recognition (both span identification and CUI normalization) come from leveraging large amounts of unannotated texts and using word embeddings as additional feature in the task. The detection of discontiguous disorder seems to still be an open challenge for the community, however.
The new task of template filling (identifying nine attributes for a given disorder) was met with enthusiasm by the participating teams. We introduced a variety of evaluation metrics to capture the different aspects of the task. Different approaches show that while some attributes are harder to identify than other, overall the best performing teams achieved excellent results.