CMUQ-Hybrid: Sentiment Classification By Feature Engineering and Parameter Tuning

,


Introduction
With the proliferation of Web2.0, people increasingly express and share their opinion through social media. For instance, microblogging websites such as Twitter 1 are becoming a very popular communication tool. An analysis of this platform reveals a large amount of community messages expressing their opinions and sentiments on different topics and aspects of life. This makes Twitter a valuable source of subjective and opinionated text that could be used in several NLP research works on sentiment analysis. Many approaches for detecting subjectivity and determining polarity of opinions in Twitter have been proposed (Pang and Lee, 2008;Davidov et al., 2010;Pak and Paroubek, 2010;Tang et al., 2014). For instance, the Twitter sentiment analysis shared task (Nakov et al., 2013) is an interesting testbed to develop and evaluate sentiment analysis systems on social media text. Participants are asked to implement a system capable of determining whether a given tweet expresses positive, negative or neutral sentiment. In this paper, we describe the CMUQ-Hybrid system we developed to participate in the two subtasks of SemEval 2014 Task 9 (Rosenthal et al., 2014). Our system uses an SVM classifier with a rich set of features and a parameter optimization framework.

Data Preprocessing
Working with tweets presents several challenges for NLP, different from those encountered when dealing with more traditional texts, such as newswire data. Tweet messages usually contain different kinds of orthographic and typographical errors such as the use of special and decorative characters, letter duplication used generally for emphasis, word duplication, creative spelling and punctuation, URLs, #hashtags as well as the use of slangs and special abbreviations. Hence, before building our classifier, we start with a preprocessing step on the data, in order to normalize it. All letters are converted to lower case and all words are reduced to their root form using the WordNet Lemmatizer in NLTK 2 (Bird et al., 2009). We kept only some punctuation marks: periods, commas, semi-colons, and question and exclamation marks. The excluded characters were identified to be performance boosters using the best-first branch and bound technique described in Section 3.

Feature Extraction
Out of a wide variety of features, we selected the most effective features using the best-first branch and bound method (Neapolitan, 2014), a search tree technique for solving optimization problems. We used this technique to determine which punctuation marks to keep in the preprocessing step and in selecting features as well. In the feature selection step, the root node is represented by a bag of words feature, referred as textual tokens.
At each level of the tree, we consider a set of different features, and iteratively we carry out the following steps: we process the current feature by generating its successors, which are all the other features. Then, we rank features according to the f-score and we only process the best feature and prune the rest. We pass all the current pruned features as successors to the next level of the tree. The process iterates until all partial solutions in the tree are processed or terminated. The selected features are the following: Sentiment lexicons : we used the Bing Liu Lexicon (Hu and Liu, 2004), the MPQA Subjectivity Lexicon (Wilson et al., 2005), and NRC Hashtag Sentiment Lexicon (Mohammad et al., 2013). We count the number of words in each class, resulting in three features: (a) positive words count, (b) negative words count and (c) neutral words count.
Negative presence: presence of negative words in a term/tweet using a list of negative words. The list used is built from the Bing Liu Lexicon (Hu and Liu, 2004).
Textual tokens: the target term/tweet is segmented into tokens based on space. Token identity features are created and assigned the value of 1.
Overall polarity score: we determine the polarity scores of words in a target term/tweet using the Sentiment140 Lexicon (Mohammad et al., 2013) and the SentiWordNet lexicon (Baccianella et al., 2010). The overall score is computed by adding up all word scores.
Level of association: indicates whether the overall polarity score of a term is greater than 0.2 or not. The threshold value was optimized on the development set.
Sentiment frequency: indicates the most frequent word sentiment in the tweet. We determine the sentiment of words using an automatically generated lexicon. The lexicon comprises 3,247 words and their sentiments. Words were obtained from the provided training set for task-A and sentiments were generated using our expression-level classifier.
We used slightly different features for Task-A and Task-B. The features extracted for each task are summarized in Table 1.

Modeling Kernel Functions
Initially we experimented with both logistic regression and the Support Vector Machine (SVM) (Fan et al., 2008), using the Stochastic Gradient Descent (SGD) algorithm for parameter optimization. In our development experiments, SVM outperformed and became our single classifier. We used the LIBSVM package (Chang and Lin, 2011) to train and test our classifier.
An SVM kernel function and associated parameters were optimized for best F-score on the development set. In order to avoid the model overfitting the data, we select the optimal parameter value only if there are smooth gaps between the near neighbors of the corresponded F-score. Otherwise, the search will continue to the second optimal value.
In machine learning, the difference between the number of training samples, m, and the number of features, n, is crucial in the selection process of SVM kernel functions. The Gaussian kernel is suggested when m is slightly larger than n. Otherwise, the linear kernel is recommended. In Task-B, the n : m ratio was 1 : 3 indicating a large difference between the two numbers. Whereas in Task-A, a ratio of 5 : 2 indicated a small difference between the two numbers. We selected the theoretical types, after conducting an experimental verification to identify the best kernel function according to the f-score.
We used a radical basis function kernel for the expression-level task and the value of its gamma parameter was adjusted to 0.319. Whereas, we used a linear function kernel for the message-level task and the value of its cost parameter was adjusted to 0.053.

Experiments and Results
In this section, we describe the data and the several experiments we conducted for both tasks. We train and evaluate our classifier with the training, development and testing datasets provided for the SemEval 2014 shared task. A short summary of the data distribution is shown in Table 2.

Dataset
Postive Negative Neutral Task-A: Train (  Our test dataset is composed of five different sets: The test dataset is composed of five different sets: Twitter2013 a set of tweets collected for the SemEval2013 test set, Twitter2014, tweets collected for this years version, LiveJournal2014 consisting of formal tweets, SMS2013, a collection of sms messages, TwitterSarcasm, a collection of sarcastic tweets.

Task-A
For this task, we train our classifier on 10,586 terms (9,451 terms in the training set and 1,135 in the development set), tune it on 4,435 terms, and evaluate it using 10,681 terms. The average F-score of the positive and negative classes for each dataset is given in the first part of Table 3. The best F-score value of 88.94 is achieved on the Twitter2013.
We conducted an ablation study illustrated in the second part of Table 3 shows that all the selected features contribute well in our system performance. Other than the textual tokens feature, which refers to a bag of preprocessed tokens, the study highlights the role of the term polarity score feature: −4.20 in the F-score, when this feature is not considered on the TwitterSarcasm dataset.
Another study conducted is a feature correlation analysis, in which we grouped features with similar intuitions. Namely the two features negative presence and negative words count are grouped as "negative features", and the features positive words count and negative words count are grouped as "words count". We show in Table 4 the effect on f-score after removing each group from the features set. Also we show the f-score after removing each individual feature within the group. This helps us see whether features within a group are redundant or not. For the Twitter2014 dataset, we notice that excluding one of the features in any of the two groups leads to a significant drop, in comparison to the total drop by its group. The uncorrelated contributions of features within the same group indicate that features are not redundant to each other and that they are indeed capturing different information. However, in the case of the TwitterSarcasm dataset, we observe that the negative presence feature is not only not contributing to the system performance but also adding noise to the feature space, specifically, to the negative words count feature.

Task-B
For this task, we trained our classifier on 11,338 tweets (9,684 terms in the training set and 1,654 in the development set), tuned it on 3,813 tweets, and evaluated it using 8,987 tweets. Results for different feature configurations are reported in Table 5.
It is important to note that if we exclude the textual tokens feature, all datasets benefit the most from the polarity score feature. It is interesting to note that the bag of words, referred to as textual tokens, is not helping in one of the datasets, the TwitterSarcasm set. For all datasets, performance could be improved by removing different features.
In Table 5, we observe that the Negative presence feature decreases the F-score on the Twitter-Sarcasm dataset. This could be explained by the fact that negative words do not usually appear in a negative implication in sarcastic messages. For example, this tweet: Such a fun Saturday catching up on hw. which has a negative sentiment, is classified positive because of the absence of negative words. Table 5 shows that the textual tokens feature increases the classifier's performance up to +21.07 for some datasets. However, using a large number of features in comparison to the number of training samples could increase data sparseness and lower the classifier's performance.
We conducted a post-competition experiment to examine the relationship between the number of features and the number of training samples. We   Table 4: Task-A features correlation analysis. We grouped features with similar intuitions and we calculated F-scores on each set along with the effect when removing one feature at a time.
fixed the size of our training dataset. Then, we compared the performance of our classifier using only the bag of tokens feature, in two different sizes. In the first experiment, we included all tokens collected from all tweets. In the second, we only considered the top 20 ranked tokens from each tweet. Tokens were ranked according to the difference between their highest level of association into one of the sentiments and the sum of the rest. The level of associations for tokens were determined using the Sentiment140 and SentiWord-Net lexicons. The threshold number of tokens was identified empirically for best performance. We found that the classifier's performance has been improved by 2 f-score points when the size of tokens bag is smaller. The experiment indicates that the contribution of the bag of words feature can be increased by reducing the size of vocabulary list.

Error Analysis
Our efforts are mostly tuned towards task-A, hence our inspection and analysis is focused on task-A. The error rate calculated per sentiment class: positive, negative and neutral are 6.8%, 14.9% and 93.8%, respectively. The highest error rate in the neutral class, 93.8%, is mainly due to the few neutral examples in the training data (only 5% of the data). Hence the system could not learn from such a small set of neutral class examples.
In the case of negative class error rate, 14.9%, most of which were classified as positive. An example of such classification: I knew it was too good to be true OTL. Since our system highly relies on lexicon, hence looking at lexicon assigned polarity to the phrase too good to be true which is positive, happens because the positive words good and true has dominating positive polarity.
Lastly for the positive error rate, which is relatively lower, 6%, most of which were classified negative instead of positive. An example of such classification: Looks like we're getting the heaviest snowfall in five years tomorrow. Awesome. I'll never get tired of winter. Although the phrase carries a positive sentiment, the individual negative words of the phrase never and tired again dominates over the phrase.

Conclusion
We described our systems for Twitter Sentiment Analysis shared task. We participated in both tasks, but were mostly focused on task-A. Our hybrid system was assembled by integrating a rich set of lexical features into a framework of feature selection and parameter tuning, The polarity  Table 5: Task B feature ablation study. F-scores calculated on each set along with the effect when removing one feature at a time.
score feature was the most important feature for our model in both tasks. The F-score results were consistent across all datasets, except the Twitter-Sarcasm dataset. It indicates that feature selection and parameter tuning steps were effective in generalizing the model to unseen data.