CISUC-KIS: Tackling Message Polarity Classification with a Large and Diverse Set of Features

,


Introduction
Everyday people transmit their opinion in social networks and microblogging services. Identifying the sentiment transmitted in all those shared messages is of great utility for recognizing trends and supporting decision making, key in areas such as social marketing. Sentiment Analysis deals with the computational treatment of sentiments in natural language text, often normalized to positive or negative polarities. It is a very challenging task, not only for machines, but also for humans.
SemEval 2014 is a semantic evaluation of Natural Language Processing (NLP) that comprises several tasks. This paper describes our approach to the Sentiment Analysis in Twitter task, which comprises two subtasks: (A) Contextual Polarity Disambiguation; and (B) Message Polarity Classification. We ended up addressing only task B, which is more sentence oriented, as it targets the polarity of the full messages and not individual words in those messages. This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http: //creativecommons.org/licenses/by/4.0/ We tackled this task with a machine learningbased approach, in which we first collect several features from the analysis of the given text at several levels. The collected features are then used to learn a sentiment classification model, which can be done with different algorithms. Features were collected from several different resources, including: sentiment lexicons, dictionaries and available APIs for this task. Moreover, since microblogging text has particular characteristics that increase the difficulty of NLP, we gave special focus on text pre-processing. Regarding the tested features, they went from low-level ones, such as punctuation and emoticons, to more high-level, including topics extracted using topic modelling techniques, as well features from sentiment lexicons, some structured on plain words and others based on WordNet, and thus structured on word senses. Using the latter, we even explored word sense disambiguation. We tested several learning algorithms with all these features, but Support Vector Machines (SVM) led to the best results, so it was used for the final evaluation.
In all our runs, a model was learned from tweets, and no SMS were used for training. The model's performance was assessed with the F-Score of positive and negative classes, with 10fold cross validation. In the official evaluation, we achieved very interesting scores, namely: 74.46% for the LiveJournal2014 (2nd place), 65.9% for the SMS2013 (7th), 67.56% for the Twitter2013 (7th), 67.95% for the Twitter2014 (4th) and 55.49% for the Twitter2014Sarcasm (4th) datasets, which ranked us always among the top-7 participations.
The next section describes the external resources exploited. Section 3 presents our approach with more detail, and is followed by section 4, where the experimental results are described. Section 5 concludes with a brief balance and the main lessons learned from our participation.

External resources
We have used several external resources, including not only several sentiment lexicons, but also dictionaries that helped normalizing the text of the tweets, as well as available APIs that already classify the sentiment transmitted by a piece of text.

Dictionaries
These included handcrafted dictionaries with the most common abbreviations, acronyms, emoticons and web slang used on the Internet and their meaning. Also, a list of regular expressions with elongated words like 'loool' and 'loloolll', which can be normalized to 'lol', and a set of idiomatic expressions and their corresponding polarity.
All of a them classify a given text snippet as positive or negative. Sentiment140 returns a value which can be 0 (negative polarity), 2 (neutral), and 4 (positive). SentimentAnalyzer returns -1 (negative) or 1 (positive), and SentiStrength a strength value between 1 and 5 (positive) or -1 and -5 (negative).

Approach
Our approach consisted of extracting lexical, syntactic, semantic and sentiment information from the tweets and using it in the form of features, for learning a sentiment classifier that would detect polarity in messages. This is a popular approach for these types of tasks, followed by other systems, including the winner of SemEval 2013 (Mohammad et al., 2013), where a variety of surfaceform, semantic, and sentiment features was used. Our set of features is similar for the base classifier are similar, except that we included additional features that take advantage of word disambiguation to get the polarity of target word senses.

Features
Among the collected features, some were related to the content of the tweets and others were obtained from the sentiment lexicons.

Content Features
The tweets were tokenized and part-ofspeech (POS) tagged with the CMU ARK Twitter NLP Tool (Gimpel et al., 2011) and Stanford CoreNLP (Toutanova and Manning, 2000). Each tweet was represented as a feature vector containing the following group of features: (i) emoticons (presence or absence, sum of all positive and negative polarities associated with each, polarity of the last emoticon of each tweet); (ii) length (total length of the tweet, average length per word, number of words per tweet); (iii) elongated words (number of all the words containing a repeated character more than two times); (iv) hashtags (total number of hashtags); (v) topic modelling (id of the corresponding topic); (vi) capital letters (number of words in which all letters are capitalized); (vii) negation (number of words that reverse polarity to a negative context, such as 'no' or 'never'); (viii) punctuation (number of punctuation sequences with only exclamation points, question marks or both, ASCII code of the most common punctuation and of the last punctuation in every tweet); (ix) dashes and asterisks (number of words surrounded by dashes or asterisks, such as '*yay*' or '-me-'); (x) POS (number of nouns, adjectives, adverbs, verbs and interjections).

Lexicon Features
A wide range of features were created using the lexicons. For each tweet and for each lexicon the following set of features were generated: (i) total number of positive and negative opinion words; (ii) sum of all positive/negative polarity values in the tweet; (iii) the highest positive/negative polarity value in the tweet; and (iv) the polarity value of the last polarity word. Those features were collected for: unigrams, bigrams and pairs (only on the NRC Hashtag Lexicon and Senti-ment140), nouns, adjectives, verbs, interjections, hashtags, all caps tokens (e.g 'GO AWAY'), elongated words, asterisks and dashes tokens.
Different approaches were followed to get the polarity of each word from the wordnets. From SentiWordNet, we computed combined scores of all senses, with decreasing weights for lower ranked senses, as well as the scores of the first sense only, both considering: (i) positive and negative; (ii) just positive; (iii) just negative scores. Moreover, we performed word sense disambiguation using the full WordNet 3.0 to get the previous scores for the selected sense. For this purpose, we applied the Lesk Algorithm adapted to wordnets (Banerjee and Pedersen, 2002), using all the tweet's content words as the word context, and the synset words, gloss words and words in related synsets as the synset's context. Given that Senti-WordNet is aligned to WordNet 3.0, after selecting the most adequate sense of the word, we could get its polarity scores. From Q-WordNet, similar scores were computed but, since it doesn't use a graded strength and only classifies word senses as positive or negative, there were just positive or just negative scores.

Classifier
In our final approach we used a SVM (Fan et al., 2008) which is an effective solution in high dimensional spaces and proved to be the best learning algorithm for this task. We tested various kernels (e.g. PolyKernel, RBF) and their parameters with cross validation on the training data. Given the results, we confirmed that the RBF kernel, computed according to equation 1, is most effective with a C = 4 and a γ = 0.0003.
(1) Considering we are working on a multi-class classification problem, we implemented the "oneagainst-one" approach (Knerr et al., 1990) where #classes * (#classes − 1)/2 classifiers are constructed and each one trains data from classes. Due to the non-scale invariant nature of SVM algorithms, we've scaled our data on each attribute to have µ = 0 and σ = 1 and took caution against class unbalance.

Experiments
For training the SVM classifier, we used a set of 9,634 tweets with a known polarity and also 1,281 tweets as development test to grid search the best parameters. No SMS messages were used as training or as development test. For the scorer function, we used a macro-averaged F-Score of positive and negative classes -the one made available and used by the task organizers.

Some Results
The results obtained by the system were 70.41% on the training set (using 10-Folds) and 71.03% on the development set, after train on the training set. When tested against the training set, after train in the same set, we get a score of 84.32%, which could indicate a case of underfitting. Though, our classifier generalized well, given that we got a 74.46% official score on Live-Journal2014, second in that category. On the other hand, our experiments with decision trees showed that they couldn't generalize so well, although they achieved scores of >99 on the training set. In the SMS category, our system would benefit from a specific data set in the training phase. Yet, it still managed to reach 7th place in that category. In the sarcasm category our submission ranked 4th, with a score of 58.16%, 2.69% below the best rank. On the Twitter2014 dataset, we scored 67.95% (4th), which is slightly below our prediction based on development tests. A possible explanation is that we might have over-fitted the classifier parameters when grid searching.

Features Relevance
In order to get some insights on the most relevant group of features, we did a series of experiments where each group of features were removed for the classification, then tested against the original score. We concluded that the lexicon related features contribute highly to the performance of our system, including the set of features with n-grams and POS. Clusters, sport score, asterisks and elongated words provide little gains but, on the other hand, emoticons and hashtags showed some importance and provided enough new information for the system to learn. The API information is largely captured by some of our features and that makes it much less discriminating than what they would be on their own, but still worth using for the small gain. We also observed that it is best to create a diversified set of lexicon features with extra very specific targeted features, such as punctuation, instead of focusing on using a specific lexicon alone. Even though they usually overlap in information and may perform worse individually than a hand-refined single dictionary approach, they complement each other and that results in larger gains.

Selected Parameters
For the parameter values, we did a grid search using the development set as a test. We also found that large values of C (25) and small γ values (0.0001) performed worse than smaller values of C (4) with a slightly higher γ (0.0003) when using the development set but not when using the training set under K-Folds. For the official evaluation, we opted for the best-performing results on the development set. Using intermediate values accomplished worse results in either case.

Concluding Remarks
We have described the work developed for the subtask B of SemEval 2014 Sentiment Analysis in Twitter task. We followed a machine learning approach, with a diversified set of features, which tend to complemented each other. Some of the main takeaways are that the most important features are the lexicon related ones, including the n-grams and POS tags. Due to time constraints, we could not take strong conclusions on the impact of the word sense disambiguation related features alone. As those are probably the most differentiating features of our classifier, this is something we wish to target in the future.
To conclude, we have achieved very interesting results in terms of overall classification. Considering that this was our first participation in such an evaluation, we make a very positive balance. And of course, we are looking forward for upcoming editions of this task.