ThinkMiners: Disorder Recognition using Conditional Random Fields and Distributional Semantics

In 2014, SemEval organized multiple challenges on natural language processing and information retrieval. One of the task was analysis of the clinical text. This challenge is further divided into two tasks. The task A of the challenge was to extract disorder mention spans in the clinical text and the task B was to map each of the disorder mentions to a unique Uniﬁed Medical Language System Concept Unique Iden-tiﬁer. We participated in the task A and developed a clinical disorder recognition system. The proposed system consists of a Conditional Random Fields based approach to recognize disorder entities. The SemEval challenge organizers manually annotated disorder entities in 298 clinical notes, of which 199 notes were used for training and 99 for development. On the test data, our system achieved the F-measure of 0.844 for entity recognition in relaxed and 0.689 in strict evaluation.


Introduction
Mining concepts from the electronic medical records such as clinical reports, discharge summaries as well as large number of doctor's notes has become an utmost important task for automatic analysis in the medical domain. Identification and mapping of the concepts like symptoms, disorders, surgical procedures, body sites to a normalized standards are usually the first steps to-This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ wards understanding natural language text in the medical records.
In this paper, we describe a machine learning based disorder recognition system for the Task 7A of 2014 SemEval challenge. In Section 2 we give a background of the existing solutions to tackle the problem. Section 3 covers our approach in detail, followed by evaluation and conclusion in Section 4 and Section 5 respectively.

Background
In recent times, many systems have been developed to extract clinical concepts from various types of clinical notes. The earlier natural language processing (NLP) systems were mainly built heavily using domain knowledge i.e. medical dictionaries. These systems include MetaMap (Aronson and Lang, 2010), Hi-TEX (Zeng et al., 2006), KnowledgeMap (Denny et al., 2003), MedLEE (Friedman et al., 1994), SymText (Koehler, 1994) and Mplus (Christensen et al., 2002). In the past couple of years, researchers have been exploring the use of machine learning algorithms in the clinical concept detection. To promote the research in this field many organizations such as ShARe/CLEF, SemEval have organized a few clinical NLP challenges. In CLEF 2013(Pradhan et al., 2013, the challenge was to recognize medication-related concepts. Both rulebased (Fan et al., 2013;Ramanan et al., 2013;Wang and Akella, 2013) and machine learning based methods as well as hybrid methods (Xia et al., 2013;Osborne et al., 2013;Hervas et al., 2013) were developed. In this shared-task sequential labeling algorithms (i.e., Conditional Random Fields (CRF)) (Gung, 2013;Patrick et al., 2013;Bodnari et al., 2013;Zuccon et al., 2013) and machine learning methods (i.e., Support Vector Machine (SVM)) (Cogley et al., 2013) have been demonstrated to achieve promising performance when provided with a large annotated corpus for

Approach
Entity recognition has been tried in various domains like news articles, Wikipedia, sports articles, financial reports and clinical texts. In clinical text, entities can vary from medical procedures, disorders, body site indicators etc. Clinical text also presents with a peculiar concept of disjoint disorders/entities. This phenomenon is common in clinical domain compared to others and further complicates entity extraction from clinical notes.

Data
The data consisted of around 298 notes from different clinical types including radiology reports, discharge summaries, ECG and ECHO reports. For each note, disorder entities were annotated based on a pre-defined guidelines. The data set was further divided into two, with 199 notes in the training set and 99 notes in the development set. The training set contains 5811 disorders where as the development contained 5340 disorders. Figure  1 shows the distribution of the training and development set respectively.

Data Preprocessing
In the pre-processing step we tokenized, lemmatized and tagged the text with part of speech using the Apache cTAKES 1 (Savova et al., 2010). Further, section and source meta data extraction is done for the text in the documents.
In Named Entity Recognition (NER), when solved using machine learning, the text is typically converted to BIO format (Beginning, Inside and Outside the entity). BIO representation means the words in the text are assigned one of the following tags B -begin, I -inside and O -outside of the entity i.e. in this case a disorder. So now the task of NER is a sequence labeling problem to assign the labels to the tokens. Especially in the medical domain, the challenge is more complicated due to the presence of disjoint disorders (<10%), which could not be solved using the traditional BIOnotation. BIO approach works well with entities which are consecutive. So, we took an enhanced approach (Tang et al., 2013a) where the consecutive disorders are assigned traditional BIO tags and for disjoint disorders we create two tag sets a) D{B,I} : for disjoint entity words which are not shared by multiple concepts; and b) H{B,I}: for disjoint entity words which belong to more than one disjoint concept.
The following examples show the annotations of consecutive as well as disjoint disorders.

Sequence Labeling
We have used Conditional Random Fields (CRF), a popular approach to solve sequence labeling tasks. CRF++ 2 was used as an implementation of CRF for our purpose.
Feature set used for the learning algorithm: • Word level features: words [-2,2], suffix and prefix.
• Discourse features: source & section. Sentence containing disorder mentions usually have similar syntactic patterns based on sections (ex: 'Past Medical History') and source type (ex: discharge summary, radiology report). To capture this, source and section meta data have been provided as a feature.
• Distributional semantics: We used a contextual similarity based approach from the popular concept called NC-value (Frantzi et al., 2000).
We followed the following steps to encapsulate the distributional semantics into the learning model: -For all the disorders in the training data we created two sets of contextual words namely context before (CB a train) and context after (CA a train). These words belong to open class (Noun, Verb, Adjective, Adverb) allocated for each section (S j ). -Weights are calculated for the contextual words.
-For each word in the test data we created a similar sets of contextual words(CB a , CA a ) as above. -Two scores are calculated for each token based on the product of frequency of the contextual word per section S j with weight calculated of that word in the training set.
For each section (S j ): where a is the candidate term, CB a is the set of context words of "a" in a window of [-2,0], CA a is the set of context words of "a" in a window of [0,2], S j is a section like "Past Medical History", "Lab Reports" etc. b is a word from CB a or CA a , f a (b test ) is the frequency of b as a term context word of "a" in the test set, weight(b train ) is the weight of b as term context word of a disorder in the training set, NC-value B (a) is the distributional semantic score of contextual words before the candidate term, NC-value A (a) is the distributional semantic score of contextual words after the candidate term.
-Further a similarity class is calculated based on a set of thresholds on the NC-value namely High-Sim, Med-Sim, Low-Sim and assigned to the tokens.
Most of the features were similar to that of the previous approaches (Tang et al., 2013a;Tang et al., 2012;Jiang et al., 2011) with an addition of an innovative distributional semantics based features (Nc-value B , NC-value A ), which we have tried and tested for concept mining in clinical text.

Evaluation
The evaluation was done in two categories a) strict evaluation: exact match, which requires the starting and ending of the concept to be the same as the gold standard data b) relaxed evaluation: here the concepts don't match exactly with the start and end of the concept but may overlap.
In the strict and relaxed evaluation, the best Fmeasure among our system was 0.689, 0.844 without the distributional semantics where as best Precision was 0.907, 0.749 with the distributional semantics as a feature. Table 1. shows the detailed result.

Conclusion
Extraction of the concepts from the medical text is the fundamental task in the process of analysing patient data. In this paper we have tried a CRF based approach to mine the disorder terms from the clinical free text. We have tried various word We have observed an increase (+1.5%) in precision but a drastic fall (-4.4%) in recall while using the distributional semantic feature. Ideally this feature has to improve the results because it takes contextual features into consideration. In our opinion inappropriate scaling of the feature values might have caused the drop. Further we would like to investigate the use of large unlabeled data, dependency tree based context and more experiments have to be carried out like threshold setting, feature value scaling to show better results. Also due to license issues we could not use UMLS dictionary. From our survey we figured out that 2-3% of improvement has been observed when the concepts from the dictionary are used.