Coooolll: A Deep Learning System for Twitter Sentiment Classification

In this paper, we develop a deep learning system for message-level Twitter sentiment classiﬁcation. Among the 45 submitted systems including the SemEval 2013 participants, our system ( Coooolll ) is ranked 2nd on the Twitter2014 test set of SemEval 2014 Task 9. Coooolll is built in a supervised learning framework by concatenating the sentiment-speciﬁc word embedding ( SSWE ) features with the state-of-the-art hand-crafted features. We develop a neural network with hybrid loss function 1 to learn SSWE, which encodes the sentiment information of tweets in the continuous representation of words. To obtain large-scale training corpora, we train SSWE from 10M tweets collected by positive and negative emoticons, without any manual annotation. Our system can be easily re-implemented with the publicly available sentiment-speciﬁc word embedding.


Introduction
Twitter sentiment classification aims to classify the sentiment polarity of a tweet as positive, negative or neutral (Jiang et al., 2011;Hu et al., 2013;Dong et al., 2014). The majority of existing approaches follow Pang et al. (2002) and employ machine learning algorithms to build classifiers from tweets with manually annotated sentiment polarity. Under this direction, most studies focus on designing effective features to obtain better classification performance (Pang and Lee, 2008;Liu, 2012;Feldman, 2013). For example,  implement diverse sentiment lexicons and a variety of hand-crafted features. To leverage massive tweets containing positive and negative emoticons for automatically feature learning,  propose to learn sentiment-specific word embedding and Kalchbrenner et al. (2014) model sentence representation with Dynamic Convolutional Neural Network.
In this paper, we develop a deep learning system for Twitter sentiment classification. Firstly, we learn sentiment-specific word embedding (SSWE) , which encodes the sentiment information of text into the continuous representation of words (Mikolov et al., 2013;Sun et al., 2014). Afterwards, we concatenate the SS-WE features with the state-of-the-art hand-crafted features , and build the sentiment classifier with the benchmark dataset from SemEval 2013 (Nakov et al., 2013). To learn SSWE, we develop a tailored neural network, which incorporates the supervision from sentiment polarity of tweets in the hybrid loss function. We learn SSWE from tweets, leveraging massive tweets with emoticons as distantsupervised corpora without any manual annotations.
We evaluate the deep learning system on the test set of Twitter Sentiment Analysis Track in Se-mEval 2014 2 . Our system (Coooolll) is ranked 2nd on the Twitter2014 test set, along with the SemEval 2013 participants owning larger training data than us. The performance of only using SSWE as features is comparable to the stateof-the-art hand-crafted features (detailed in Table 3), which verifies the effectiveness of the sentiment-specific word embedding. We release the sentiment-specific word embedding learned  from 10 million tweets, which can be easily used to re-implement our system and adopted off-theshell in other sentiment analysis tasks.

A Deep Learning System
In this section, we present the details of our deep learning system for Twitter sentiment classification. As illustrated in Figure 1, Coooolll is a supervised learning method that builds the sentiment classifier from tweets with manually annotated sentiment polarity. In our system, the feature representation of tweet is composed of two parts, the sentiment-specific word embedding features (SS-WE features) and the state-of-the-art hand-crafted features (STATE features). In the following parts, we introduce the SSWE features and STATE features, respectively.

SSWE Features
In this part, we first describe the neural network for learning sentiment-specific word embedding. Then, we generate the SSWE features of a tweet from the embedding of words it contains. Our neural network is an extension of the traditional C&W model (Collobert et al., 2011), as illustrated in Figure 2. Unlike C&W model that learns word embedding by only modeling syntactic contexts of words, we develop SSWE u , which captures the sentiment information of sentences as well as the syntactic contexts of words. Given an original (or corrupted) ngram and the sentiment polarity of a sentence as the input, SSWE u predicts a two-dimensional vector for each input ngram. The two scalars (f u 0 , f u 1 ) stand for language model score and sentiment score of the input ngram, re-so cooool :D syntactic sentiment spectively. The training objectives of SSWE u are that (1) the original ngram should obtain a higher language model score f u 0 (t) than the corrupted ngram f u 0 (t r ), and (2) the sentiment score of original ngram f u 1 (t) should be more consistent with the gold polarity annotation of sentence than corrupted ngram f u 1 (t r ). The loss function of SSWE u is the linear combination of two hinge losses, where where t is the original ngram, t r is the corrupted ngram which is generated from t with middle word replaced by a randomly selected one, loss cw (t, t r ) is the syntactic loss as given in Equation 2, loss us (t, t r ) is the sentiment loss as described in Equation 3. The hyper-parameter α weighs the two parts.
where δ s (t) is an indicator function reflecting the sentiment polarity of a sentence, whose value is 1 if the sentiment polarity of tweet t is positive and -1 if t's polarity is negative. We train sentimentspecific word embedding from 10M tweets collected with positive and negative emoticons (Hu et al., 2013). The details of training phase are described in . After finish learning SSWE, we explore min, average and max convolutional layers (Collobert et al., 2011;Socher et al., 2011;Mitchell and Lapata, 2010), to obtain the tweet representation. The result is the concatenation of vectors derived from different convolutional layers.

STATE Features
We re-implement the state-of-the-art hand-crafted features  for Twitter sentiment classification. The STATE features are described below.
• All-Caps. The number of words with all characters in upper case.
• Emoticons. We use the presence of positive (or negative) emoticons and whether the last unit of a segmentation is emoticon 3 .
• Elongated Units. The number of elongated words (with one character repeated more than two times), such as gooood.
• Sentiment Lexicon. We utilize several sentiment lexicons 4 to generate features. We explore the number of sentiment words, the score of last sentiment words, the total sentiment score and the maximal sentiment score for each lexicon.
• Negation. The number of individual negations 5 within a tweet.
• Punctuation. The number of contiguous sequences of dot, question mark and exclamation mark.
• Cluster. The presence of words from each of the 1,000 clusters from the Twitter NLP tool (Gimpel et al., 2011).

Experiments
We evaluate our deep learning system by applying it for Twitter sentiment classification within a supervised learning framework. We conduct experiments on both positive/negative/neutral and positive/negative classification of tweets.

Dataset and Setting
We train the Twitter sentiment classifier on the benchmark dataset in SemEval 2013 (Nakov et al., 2013). The training and development sets were completely in full to task participants of SemEval 2013. However, we were unable to download all the training and development sets because some tweets were deleted or not available due to modified authorization status. The distribution of our dataset is given in Table 1. We train sentiment classifiers with LibLinear (Fan et al., 2008) on the training set and dev set, and tune parameter −c, −wi of SVM on the test set of SemEval 2013. In both experiment settings, the evaluation metric is the macro-F1 of positive and negative classes (Nakov et al., 2013).

Results and Analysis
The experiment results of different methods on positive/negative/neutral and positive/negative Twitter sentiment classification are listed in Table 3. The meanings of T1∼T5 in each column are described in Table 2 T1  T2  T3  T4  T5  T1  T2  T3  T4  T5  SSWE  ---------- Table 3: Macro-F1 of positive and negative classes in positive/negative/neutral and positive/negative Twitter sentiment classification on the test sets (T1-T5, detailed in Table 2 Table 3 (left table), we find that the learned sentiment-specific word embedding features (SSWE) performs comparable with the state-of-the-art hand-crafted features (STATE), especially on the Twitter-relevant test sets (T3 and T4) 7 . After feature combination, Coooolll yields 4.22% and 3.07% improvement by macro-F1 on T3 and T4,which verifies the effectiveness of SSWE by learning discriminate features from massive data for Twitter sentiment classification. From the 45 teams, Coooolll gets the Rank 5/2/3/2 on T1-T4 respectively, along with the Se-mEval 2013 participants owning larger training data. We also comparing SSWE with the contextbased word embedding (W2V), which don't capture the sentiment supervision of tweets. We find that W2V is not effective enough for Twitter sentiment classification as there is a big gap between W2V and SSWE on T1-T4. The reason is that W2V does not capture the sentiment information of text, which is crucial for sentiment analysis tasks and effectively leveraged for learning the sentimentspecific word embedding.
We also conduct experiments on the posi- 6 We utilize the Skip-gram model. The embedding is trained from the 10M tweets collected by positive and negative emoticons, as same as the training data of SSWE. 7 The result of STATE on T3 is different from the results reported in  and  because we have different training data with the former and different -wi of SVM with the latter. tive/negative classification of tweets. The reason is that the sentiment-specific word embedding is learned from the positive/negative supervision of tweets through emoticons, which is tailored for positive/negative classification of tweets. From Table 3 (right table), we find that the performance of positive/negative Twitter classification is consistent with the performance of 3-class classification. SSWE performs comparable to STATE on T3 and T4, and yields better performance (1.62% and 1.45% improvements on T3 and T4, respectively) through feature combination. SSWE outperforms W2V by large margins (more than 10%) on T3 and T4, which further verifies the effectiveness of sentiment-specific word embedding.

Conclusion
We develop a deep learning system (Coooolll) for message-level Twitter sentiment classification in this paper. The feature representation of Coooolll is composed of two parts, a state-of-the-art hand-crafted features and the sentiment-specific word embedding (SSWE) features. The SSWE is learned from 10M tweets collected by positive and negative emoticons, without any manual annotation. The effectiveness of Coooolll has been verified in both positive/negative/neutral and positive/negative classification of tweets. Among 45 systems of SemEval 2014 Task 9 subtask(b), Coooolll yields Rank 2 on the Twitter2014 test set, along with the SemEval 2013 participants owning larger training data.