Senti.ue: Tweet Overall Sentiment Classification Approach for SemEval-2014 Task 9

This document describes the senti.ue system and how it was used for participation in SemEval-2014 Task 9 challenge. Our system is an evolution of our prior work, also used in last year’s edition of Sentiment Analysis in Twitter. This sys-tem maintains a supervised machine learning approach to classify the tweet overall sentiment, but with a change in the used features and the algorithm. We use a restricted set of 47 features in subtask B and 31 features in subtask A. In the constrained mode, and for the ﬁve data sources, senti.ue achieved a score between 78,72 and 84,05 in subtask A, and a score between 55,31 and 71,39 in sub-task B. For the unconstrained mode, our score was slightly below, except for one case in subtask A.


Introduction
This paper describes the approach taken by a team of Universidade deÉvora's Computer Science Department in SemEval-2014 Task 9: Sentiment Analysis in Twitter (Rosenthal et al., 2014). SemEval-2014 Task 9 has an expression-level (subtask A) and a message-level (subtask B) polarity classification challenges. The first subtask aims to determine whether a word (or phrase) is positive, negative or neutral, within the textual context in which it appears. The second subtask concerns the classification of the overall text polarity, which corresponds to automatically detecting the sentiment expressed in a Twitter message. In both subtasks, systems can operate in constrained or unconstrained mode. Constrained means that learn-This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http: //creativecommons.org/licenses/by/4.0/ ing is based only on provided training texts, with the possible aid of static resources such as lexicons. Extra tweets or additional annotated documents for training are permitted only in unconstrained mode. The system we used to respond to this challenge is called senti.ue, and follows on from our previous work on Natural Language Processing (NLP) and Sentiment Analysis (SA). We developed work in automatic reputation assessment, using a Machine Learning (ML) based classifier for comments with impact on a particular target entity (Saias, 2013). We also participated in the previous edition of SemEval SA task, where we have implemented the basis for the current system. In last year's solution (Saias and Fernandes, 2013), we treated both subtasks using the same method (except the training set). We have updated the method for subtask A, now considering also the text around the area to classify, by dedicating new features to those preceding and following tweet parts. Text overall sentiment classification is the core objective of our system, and is performed, as before, with a supervised machine learning technique. For subtask B, we fixed some implementation issues in the previous version, and we went from 22 to 53 features, explained in Section 3.

Related Work
The popularity of social networks and microblogging facilitated the sharing of opinions. To know whether people are satisfied or not with a particular brand or product is of great interest to marketing companies. Much work has appeared in SA, trying to capture valuable information in expressions of contentment or discontentment. Important international scientific events, NLP related, include SA challenges and workshops. This was the case in SemEval-2013, whose task 2 (Wilson et al., 2013) required sentiment analysis of Twitter and SMS text messages. Being the pre-decessor task of the challenge for which this work was developed, it is similar to this year's Task 9. The participating systems achieved better results in contextual polarity subtask (A) than those obtained for the overall message polarity subtask (B). In that edition, the best results were obtained by systems in constrained mode. The most common method was supervised ML with features that can be related to text words, syntactic function, discourse elements relation, internet slang and symbols, or clues from sentiment lexicons. In that task, the NRC-Canada system (Mohammad et al., 2013) obtained the best performance, achieving an F1 of 88.9% in subtask A and 69% in subtask B. That system used one SVM classifier for each subtask, together with text surface based features, features associated with manually created and automatically generated sentiment lexicons, and n-gram features. Other systems with good results in that task were GU-MLT-LT (Günther and Furrer, 2013) and AVAYA (Becker et al., 2013). The first was implemented in the Python language. It includes features for: text tokens after normalization, stems, word clusters, and two values for the accumulated positive and accumulated negative SentiWordNet (Baccianella et al., 2010) scores, considering negation. Its machine learning classifier is based on linear models with stochastic gradient descent. The approach taken in the AVAYA system centers on training highdimensional, linear classifiers with a combination of lexical and syntactic features. This system uses Bag-of-Words features, with negation represented in word suffix, and including not only the raw word forms but also combinations with lemmas and PoS tags. Then, word polarity features are added, using the MPQA lexicon (Wiebe et al., 2005), as well as syntactic dependency and PoS tag features. Other features consider emoticons, capitalization, character repetition, and emphasis characters, such as asterisks and dashes. The resulting model was trained with the LIBLINEAR (Fan et al., 2008) classification library. Another NLP task very close to SA is polarity classification on the reputation of an entity. Here, instead the sentiment in the perspective of the opinion holder, the goal is to detect the impact of this particular opinion on some entity's reputation.
The diue system (Saias, 2013) uses a supervised ML approach for reputation polarity classification, including Bag-of-Words and a limited set of fea-tures based on sentiment lexicons and superficial text analysis.

Method
This work follows on from our previous participation in SemEval-2013 SA task, where we have devoted greater effort to subtask B. We start by explaining our current approach for this subtask, and then we describe how such classifier is also used in subtask A.

Message Polarity Classification
The senti.ue system maintains a supervised machine learning approach to perform the overall sentiment classification. As before, Python and the Natural Language Toolkit (NLTK 1 ) are used for text processing and ML feature extraction. The first step was to obtain the tweet content and forming the instances of the training set. During the download phase, several tweets were not found. In constrained mode, we got only 7352 instances available for training. Tweet preprocessing includes tokenization, which is punctuation and white space based, negation detection, and lemmatization, through NLTK class WordNetLemmatizer. After that, the system runs the ML component. Instead of the solution we used in 2013, with two differently configured classifiers in a pipeline, we chose to use a single classifier, which this year is based on SciKit-Learn 2 , and to increase the number of features that are extracted to represent each instance. The classification algorithm was Support Vector Machines (SVM), using SVC 3 class, with a linear kernel and 10 −5 tolerance for stopping criterion. SVC class implementation is based on libsvm (Chang and Lin, 2011), and uses one-against-one approach for multi-class classification. From each instance, the system extracts the 47 features in Figure 1. The first two features represent the index of the first polarized token. The following represent the repeated occurrence of a question mark, and the existence of a token with negation (not, never). Then there are two features that indicate whether there is negation before positive or negative words. The following 8 fea-tures indicate whether there are positive or negative terms, just after, or near, a question mark or an exclamation mark. We build a table with words or phrases marked as positive or negative in subtask A data. Using this resource, 4 features test the presence and the count of word n-grams marked as positive or negative. Then the TA.alike features represent the same, but after lemmatization and synonym verification. To find the synonyms of a term, we used the WordNet (Princeton University, 2010) resource. The probability of each word belonging to a class was calculated. There are 3 features avgProbWordOn, one per class, that represent the average of this probability for each instance words. Next 3 features represent the same, but focusing only on the last 5 words of each text. Then we have 6 ProbLog2Prob features, representing the average of P × log 2 (P ), for all words, or only the latest 5 words, for all classes. P is the probability of the word belonging to one class. One feature cumulates the token polarity values, according to SentiWordNet. The final 12 features are based on sentiment lexicons: AFINN (Nielsen, 2011), Bing Liu (Liu et al., 2005), MPQA, and a custom polarity table with some manually entered entries. For each resource, we count the instance tokens with negative and positive polarity, and create a feature direction, having the value 1 if countTokens.pos>countTokens.neg, -1 if count-Tokens.pos<countTokens.neg, or 0. For the unconstrained mode, the only difference is the use of more instances for the training set, with 3296 short texts obtained from SemEval-2014 Task 4 data 4 , about laptops and restaurants.

Contextual Polarity Disambiguation
In this subtask, the download phase fetched only 6506 tweets. These instances have boundaries marking the substring to classify. Our system starts by splitting the document into text segments: fullText, leftText, rightText, sentenceText, chosenText. The first corresponds to the entire tweet. The following represent the text before and the text after the chosen text. Then we have the sentence where the chosen text is, and finally the text segment that systems must classify. The preprocessing described before is then applied to each of these text segments. For each instance, the system generates the 31 features listed in Figure 2  tenceText and leftText instance segments. These values represent the count of polarized tokens, and the direction (1, 0, or -1, as before), according to 3 sentiment lexicons. The final 4 features have the overall sentiment classification, using the subtask B classifier, for each text segment: leftText, right-Text, sentenceText, and chosenText. In unconstrained mode the instances used for subtask A training are the same. The difference with respect to the constrained mode is the overall sentiment classifier used for the last 4 features, which corresponds to the unconstrained classifier of subtask B. This subtask has specific features, different from those used in the previous subtask, and after some tests with SciKit-Learn classifiers, we found that, in this case, our best results were not obtained with SVM. For subtask A, we chose Gradient Boosting classifier 5 , an ensemble method that combines the predictions of several models, configured with deviance loss function, 0.1 for learning rate, and 100 regression estimators with individual maximum depth of 4.

Results
We submitted four runs, with the system output for each subtask, and both constrained and unconstrained modes. Test set documents come from five sources: LiveJournal blogs (LJ'14), SMS test (SMS'13) and Twitter test (T'13) data from last year, a new Twitter collection (T'14), and 100 tweets whose text includes sarcasm (T'14s). The primary metric to evaluate the results is the average F-measure for positive and negative classes. Table 1 shows the score obtained by our system. In the constrained mode, and for the five data sources, senti.ue achieved a score between 78,72 and 84,05 in subtask A, and a score between 55,31 and 71,39 in subtask B. Comparing the evaluation between constrained and unconstrained modes, the latter was always a little below, except for one case in subtask A and SMS2013 data, where the extra training data led to a 4% score improvement. In this SA challenge there were a total of 27 submissions in subtask A and 50 submissions in subtask B. Among these, the best score and the average score for each subtask are shown in Table 2. In both subtaks, our system result is above the participating systems average score. In subtask A and the Twitter Sarcasm 2014 collection (T'14s), senti.ue achieved the highest score, with 82,75% in constrained mode. For each data set, tables 3 and 4 show the precision and recall of our system result on the highest scored mode, per class. In subtask A precision is between 64 and 99% for positive and negative classes, taking the value of zero in the neutral class. For the overall sentiment subtask, precision is similar among the 3 classes, having the minimum value in the negative class of sarcasm tweets. The best recall value was obtained in the positive

Conclusions
Continuing last year experience, we participated in SemEval-2014 Task 9 to test our approach for a real-time SA system for the English used nowadays in social media content. We changed the method for subtask A, now considering also the text around the area to classify, by dedicating new features to it, which led to good results. Our method for overall sentiment is ML based, using a restricted set of features that are dedicated to superficial text properties, negation presence, and sentiment lexicons. Without a deep linguistic analysis, our system achieved a reasonable result in subtask B. The evaluation of our solution, in both subtasks, shows an appreciable improvement, by 10% or more, when compared to our results in 2013. We believe that the additional training instances used in unconstrained mode and subtask B, about laptops and restaurants, have a writing style different from most of the test set documents. And perhaps this is the cause for lower score in the unconstrained mode, something that happened also with many systems in the past edition (Wilson et al., 2013). This time, we implemented the contextual polarity solution based on the subtask B classifier. Given the results, we intend to do, in the near future, a new iteration of our system where the overall classifier will depend on (or receive features from) the current subtask A classifier. It seems to us that senti.ue feature engineering can be improved, maintaining this line of development. Once stabilized, the introduction of named entity recognition and a richer linguistic analysis will help to identify the sentiment target entities, as the ultimate goal for this system.