SentiKLUE: Updating a Polarity Classifier in 48 Hours

SentiKLUE is an update of the KLUE polarity classiﬁer – which achieved good and robust results in SemEval-2013 with a simple feature set – implemented in 48 hours

This work is licensed under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organisers. Licence details: http: //creativecommons.org/licenses/by/4.0/ system (KLUE) based on a maximum entropy classifier and a small set of features (Proisl et al., 2013). Despite its simplicity, KLUE performed very well in subtask B, ranking 5th out of 36 constrained systems on the Twitter data and 3rd out of 28 on the SMS data. Results for contextual polarity disambiguation (subtask A) were less encouraging, with rank 14 out of 21 constrained systems on the Twitter data and rank 12 out of 19 on the SMS data.
This paper describes our efforts to bring the KLUE system up to date within a period of 48 hours. The results obtained by the new SentiKLUE system are summarised in Table 1, showing that the update was successful. The ranking of the system has improved substantially in subtask A, making it one of the best-performing systems in the shared task. Rankings in subtask B are similar to those of the previous year, showing that SentiKLUE has kept up with recent developments. Moreover, differences to the best-performing systems are much smaller than in SemEval-2013.

Updating the KLUE polarity classifier
The KLUE polarity classifier is described in detail by Proisl et al. (2013). It used the following features as input for a maximum entropy classifier: • The AFINN sentiment lexicon (Nielsen, 2011), which provides numeric polarity scores ranging from −5 to +5 for 2,476 English word forms, extended with distributionally similar words. For each input message, the number of positive and negative words as well as their average polarity score were computed.
• Emoticons and Internet slang expressions that were manually classified as positive, negative or neutral. Features were generated in the same way as for the sentiment lexicon.
• A bag-of-words representation that generates a separate feature for each word form that occurs in at least 5 different messages ( f ≥ 5). Only single words (unigrams) were used, since experiments with additional bigram features did not lead to a clear improvement.
• A negation heuristic, which inverts the polarity score of the first sentiment word within 4 tokens after a negation marker. In the bag-ofwords representation, the next 3 tokens after a negation marker are prefixed with not_.
• For subtask A, these features were computed both for the marked word or phrase and for the rest of the message.
In order to improve the KLUE classifier, we drew inspiration from two other systems participating in the SemEval-2013 task: NRC-Canada (Mohammad et al., 2013), which won the task by a large margin over competing systems, and GU-MLT-LT (Günther and Furrer, 2013), which used similar features to our classifier, but obtained better results due to careful selection and tuning of the machine learning algorithm. Mohammad et al. (2013) used a huge set of features, including several sentiment lexica (both manually and automatically created), word n-grams (up to 4-grams with low frequency threshold), character n-grams (3-grams to 5-grams), Twitter-derived word clusters and a negation heuristic similar to our approach. Features with the largest impact in subtask B were sentiment lexica (esp. large automatically generated word lists), word n-grams, character n-grams and the negation heuristic, in this order. NRC-Canada achieved F-scores of 68.46 (SMS) and 69.02 (Twitter) in task B, as well as 88.00 (SMS) and 88.93 (Twitter) in task A. Günther and Furrer (2013) claim that state-ofthe-art results can be obtained with a small feature set if a suitable machine learning algorithm is chosen. They used stochastic gradient descent (SGD) and tuned its parameters by grid search. GU-MLT-LT achieved scores of 62.15 (SMS) and 65.27 (Twitter) in task B, as well as 88.37 (SMS) and 85.19 (Twitter) in task A.
We therefore decided to make use of a wider range of sentiment lexica, extend the bag-of-words representation to bigrams, implement character ngram features, and experiment with different machine learning algorithms, resulting in the Senti-KLUE system described in the following section.

The SentiKLUE system
SentiKLUE is an improved version of the KLUE system and uses the same tokenisation, preprocessing and negation heuristics; see Proisl et al. (2013) for details. The features described below are used as input for a machine learning classifier that predicts the polarity categories positive (pos), negative (neg) or neutral (ntr). As in KLUE and GU-MLT-LT, the implementations of the Python library scikit-learn (Pedregosa et al., 2011) 1 are used. We tested four different learning algorithms: logistic regression (MaxEnt), stochastic gradient descent (SGD), linear SVM (LinSVM) and SVM with a RBF kernel (SVM). Parameters were tuned by grid search and the best-performing algorithm was chosen for each subtask. SentiKLUE makes use of the following features: • Several sentiment lexica, which are treated as lists of positive and negative polarity words. Numerical scores are converted by setting appropriate cutoff thresholds. For each lexicon, we compute the number of positive and negative words occurring in a message as features, with separate counts for negated and nonnegated contexts.
-AFINN (Nielsen, 2011) 2 -Bing Liu lexicon (Hu and Liu, 2004)  • Word form unigrams and bigrams. After some experimentation, the document frequency threshold was set to f ≥ 5 for subtask B and f ≥ 2 for subtask A.
• In order to include information from character n-grams, we used a Perl implementation of ngram language models (Evert, 2008) that has already been applied successfully to text categorization tasks (boilerplate detection in the CLEANEVAL 2007 competition). We trained three separate models on positive, negative and neutral messages. We selected a 5-gram model (n = 5) with strong smoothing (q = 0.7), which minimized cross-entropy on the training data (measured by cross-validation). For each message in the training and test data, three features were generated, specifying per-character crossentropy for each of the three n-gram models. 8 • Counts of positive and negative emoticons using the same lists as in the KLUE system.
• The same negation heuristic as in KLUE. 9 6 http://www.umiacs.umd.edu/~saif/WebPages/Abstracts/ NRC-SentimentAnalysis.htm 7 ibid. 8 Note that these features had to be generated by crossvalidation on the training data to avoid catastrophic overfitting. 9 The full list of negation markers is not, don't, doesn't, won't, can't, mustn't, isn't, aren't, wasn't, weren't, couldn't, shouldn't, wouldn't. To our surprise, including further negation markers such as none, ain't or hasn't led to a decrease in classification quality.
For subtask A, we chose a simplistic strategy and computed the same set of features for the marked word or phrase instead of the entire message. In order to take context into account, the three class probabilities assigned to the complete message by a MaxEnt classifier were included as additional features. No other features describing the context of the marked expression were used.
Optionally, features were standardized and prior class weights (2× for positive, 4× for negative) were used in order to balance the predicted labels. The best-performing machine learning algorithms on the development set were MaxEnt for subtask B (L1 penalty, C = 0.3) and linear SVM for subtask A (L1 penalty, L2 loss, C = 0.5), as shown in Table 2.

Experiments and conclusion
In order to determine the importance of individual features, ablation experiments were carried out for both subtasks by deactivating one group of features at a time. Tables 3 and 4 show the resulting changes in the official criterion F p/n separately for each subset of the development and test sets, as well as micro-averaged across the full development set (DEV) and test set (GOLD). Rows are ordered by feature impact on the full gold standard. Positive values indicate that a feature group has a negative impact on classification quality: results are improved by omitting the features (which is often the case for the Sarcasm subset).
The most important features are bag-of-words unigrams and bigrams, closely followed by sentiment lexica. Training class weights had a strong positive impact in subtask B, but decreased performance in subtask A. In our official submission, they were only used for subtask B. Full-message polarity is the third most important feature in subtask A. Other features contributed relatively small individual effects, but were necessary to achieve state-of-the-art performance in combination. They are often specific to one of the subtasks or to a particular subset of the gold standard.
The bottom half of each table shows ablation results for individual sentiment lexica, with all other features active. Key resources are the standard lexica (AFINN, Liu, MPQA) as well as Twitter-specific lexica (Sentiment140, NRC Hashtag). Noisy word lists (DSM extension, SNAP, SentiWords) have a small or even a negative effect. Surprisingly, the standard lexica seem to give misleading cues on the Twitter 2014 subset (Table 3)