SAP-RI: Twitter Sentiment Analysis in Two Days

,


Introduction
Microblogging platforms and social networks have become increasingly popular for expressing opinions on a wide range of topics, hence making them valuable and ever-growing logs of public sentiment. This has motivated the development of automatic natural language processing (NLP) methods to analyse the sentiment expressed in these short, informal messages (Liu, 2012;Pang and Lee, 2008).
In particular, the study of sentiment and opinions in messages from the Twitter microblogging platform has attracted a lot of interest (Jansen et al., 2009;Pak and Paroubek, 2010; Barbosa and Feng, 2010;O'Connor et al., 2010;Bifet et al., 2011). However, comparative evaluations of sentiment analysis of Twitter messages have previously been hindered by the lack of a large benchmark data set. The goal of the SemEval 2013 task 2: Sentiment Analysis in Twitter (Nakov et al., 2013) * The work was done during an internship at SAP. This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http: //creativecommons.org/licenses/by/4.0/ and this year's continuation in the SemEval 2014 task 9: Sentiment Analysis in Twitter (Rosenthal et al., 2014) is to close this gap by hosting a shared task competition which provided a large corpus of Twitter messages which are annotated with sentiment polarity labels. The task consists of two subtasks: in subtask A contextual polarity disambiguation, participants need to predict the polarity of a given word or phrase in the context of a tweet message, in subtask B message polarity classification, participants need to predict the dominating sentiment of the complete message. Both tasks consider sentiment analysis to be a three-way classification problem between positive, negative, and neutral sentiment.
In this paper, we describe the submission of the SAP-RI team to the SemEval 2014 task 9. We challenged ourselves to develop a competitive sentiment analysis system within a very limited time frame. The complete system was implemented within only two days. Our system is based on supervised classification with support vector machines with lexical and dictionary-based features. Our system achieved an F 1 score of 77.26% for contextual polarity disambiguation and 55.47% for message polarity classification. Although our scores are about 10-20% behind the top-scoring systems, we show that it is possible to develop sentiment analysis systems via rapid prototyping with reasonable accuracy in a very short amount of time.

Methods
Our system is based on supervised classification with support vector machines and a variety of lexical and dictionary-based features. From the beginning, we decided to restrict ourselves to supervised classification and to focus on the constrained system setting. Experiments with more data or semi-supervised learning would have required additional time and the results of last year's task did not show any convincing improvements using from additional unconstrained data (Nakov et al., 2013). We cast sentiment analysis as a multi-class classification problem with three classes: positive, negative, and neutral. For the features, we tried to re-implement most of the features from the NRC-Canada system (Mohammad et al., 2013) which was the best performing system in last year's task. We describe the features in the following sections.

Task A : Features
For the contextual polarity disambiguation task, we extract features from the target phrase itself and from a surrounding word window of four words before and after the target phrase. To handle negation, we append the suffix -neg to all words in a negated context. A negated context includes any word in the target phrase or context that is following a negation word 1 up to the next following punctuation symbol.
• Word N-grams: all lowercased unigrams and bigrams from the target phrase and the context. We extract the lowercased full string of the target phrase as an additional feature.
• Character N-grams: lowercased character bigram and trigram prefixes and suffixes from all words in the target phrase and the context.
• Elongations: binary feature that indicates the presence of one or more words in the target phrase or context that have a letter repeated for 3 for more times e.g., coool.
• Emoticons: two binary features that indicate the presence of positive or negative emoticons in the target phrase or the context, respectively. Two additional binary features indicate the presence of positive or negative emoticons at the end of the target phrase or context 2 .
• Punctuation: three count features for the number of tokens that consist only of exclamation marks, only of questions marks, or a mix of exclamation marks and questions marks, in the target phrase and context, receptively.
• Casing: two binary features that indicate the presence of at least one all upper-case word and at least one title-cased word in the target phrase or context, respectively.
• Stop words: a binary feature that indicates if all the words in the target phrase or context are stop words. If so, an additional feature indicates the number of stop words: 1, 2, 3, or more stop words.
• Length: the number of tokens in the target phrase and the context, plus a binary feature that indicates the presence of any word with more than three characters.
• Position: three binary features that indicate whether a target phrase is at the beginning, in the middle, or at the end of the tweet.
• Hashtags: all hashtags in the target phrase or the context. To handle hashtags which are formed by concatenating words, e.g., #ihatemondays, we additionally split hashtags using a simple dictionary-based approach and add each token of the segmented hashtag as an additional features.
• Twitter user: binary feature that indicates whether the context or the target phrase contain a mention of a Twitter user.
• URL: binary feature that indicates whether the context or the target phrase contains a URL.
• Brown cluster: the word cluster index for each word in the context. Cluster indexes are obtained from the Brown word clusters of the ARK Twitter tagger (Owoputi et al., 2013).
• Sentiment lexicons: we add the following sentiment dictionary features for the target phrase and the context for four different sentiment lexicons (NRC sentiment lexicon, NRC Hashtag lexicon (Mohammad et al., 2013), MPQA sentiment lexicon (Wilson et al., 2005), and Bing Liu lexicon (Hu and Liu, 2004)): -the count of words with positive sentiment score. -the sum of the sentiment scores for all words.
-the maximum non-negative sentiment score for any word. -the sentiment score of the last word with positive sentiment score.
We extract these features for both the target phrase and the context. For words that are marked as negated, the sign of the sentiment scores flipped. The MPQA lexicons requires part of speech information. We use the ARK Twitter part-of-speech tagger (Owoputi et al., 2013) to tag the input with part of speech tags.

Task B : Features
For the message polarity task, we extract features from the entire tweet message. The features are similar to the features for phrase polarity disambiguation. As before we handle negation by appending the suffix -neg to all words that appear in a negated context.
• Word N-grams: all lowercased N-grams for N=1, . . . , 4 from the message. We also include "skipgrams" for each N-gram by replacing each token in the N-gram with a asterisk place holder, e.g., the cat → * cat, the *.
• Character N-grams: lowercased character level N-grams for N=3, . . . , 5 for all the words in the message. Character N-grams do not cross word boundaries.
• Elongations: count of words in the message which have a letter repeated for 3 for more times.
• Emoticons: similar to the contextual polarity disambiguation task: two binary features for presence of positive or negative emoticons in the message and two binary features indicate the presence of positive or negative emoticons at the end of the message.
• Punctuation: similar to the contextual polarity disambiguation task: three count features for the number of tokens that consist only of exclamation marks, only of questions marks, or a mix of exclamation marks and questions marks.
• Hashtags: all hashtags in the message. We do not split concatonated hashtags.  • Casing: the count of all upper-case words in the message.
• Brown cluster: similar to the contextual polarity disambiguation task: the cluster index for each word in the message.

Experiment and Results
In this section, we report experimental result for our method. We used the scikit-learn Python machine learning library (Pedregosa et al., 2011) to implement the feature extraction pipeline and the support vector machine classifier. We use a linear kernel for the support vector machine and fixed the SVM hyper-parameter C to 1.0. We found that scikit-learn allowed us to implement the system faster and resulted in much more compact code than other machine learning tools we have worked with in the past. We used the official training set provided for the SemEval 2014 task to train our system and evaluated on the test set of the SemEval 2013 task which served as development data for this year's task 3 . Tweets in the training data that were not available any more through the Twitter API were removed from the training set. An overview of the data sets is shown in Table 1. For the evaluation, we compute precision, recall and F 1 measure for the positive, negative, and neutral sentiment tweets. Following the official evaluation metric, the overall precision, recall, and F 1 measure of the system is the average of the precision, recall, and F 1 measures for positive and negative sentiment, respectively.
Here, we report a feature ablation study: we omitted each individual feature category from the complete feature set to determine its influence on the overall performance. Table 2 summarizes the results for subtask A and B. Surprisingly many of the features do not result in a reduction of the F 1 score when removed, or even increase the score,  Table 2: Experimental Results for feature ablation study. Each row shows the precision, recall, and F 1 score for the positive, negative, and neutral class and the overall precision, recall, and F 1 score after removing the particular feature from the features set.
although not significantly. The most effective features are word N-grams and the sentiment lexicons. It is interesting that the performance for the neutral class is very low for subtask A and high for subtask B. We can also see that for subtask B, our system clearly has a problem with recall for the positive and negative sentiment.
For the performance of our system in the Se-mEval 2014 shared task, we report the official overall F 1 scores of our system as released by the organizers on the official test set in Table 3 (Chen and Kan, 2012), and a new test set of sarcastic tweets. We also include the F 1 score of the best participating system for each test set and the rank of our system among all participating systems. The results of our system were fairly robust across different domains, with the exception of messages containing sarcasm which shows understanding sarcasm requires a deeper and more subtle understanding of the text that is not captured well in a simple linear model.

Conclusion
In this paper, we have described the submission of the SAP-RI team to the SemEval 2014 task 9. We showed that is possible to develop sentiment analysis systems via rapid prototyping with reasonable accuracy within a couple of days.