Blinov: Distributed Representations of Words for Aspect-Based Sentiment Analysis at SemEval 2014

,


Introduction
The sentiment analysis became an important Natural Language Processing (NLP) task in the recent few years. As many NLP tasks it's a challenging one. The sentiment analysis can be very helpful for some practical applications. For example, it allows to study the users' opinions about a product automatically.
Many research has been devoted to the general sentiment analysis (Pang et al., 2002), (Amine et al., 2013), (Blinov et al., 2013) or analysis of individual sentences (Yu and Hatzivassiloglou, 2003), (Kim and Hovy, 2004), (Wiebe and Riloff, 2005). Soon it became clear that the sentiment analysis on the level of a whole text or even sentences is too coarse. Gen-eral sentiment analysis by its design is not capable to perform the detailed analysis of an expressed opinion. For example, it cannot correctly detect the opinion in the sentence "Great food but the service was dreadful!". The sentence carries opposite opinions on two facets of a restaurant. Therefore the more detailed version of the sentiment analysis is needed. Such a version is called the aspect-based sentiment analysis and it works on the level of the significant aspects of the target entity (Liu, 2012).
The aspect-based sentiment analysis includes two main subtasks: the aspect term extraction and its polarity detection (Liu, 2012). In this article we describe the methods which address both subtasks. The methods are based on the distributed representations of words. Such word representations (or word embeddings) are useful in many NLP task, e.g. (Turian et al., 2009), (Al-Rfou' et al., 2013, (Turney, 2013).
The remainder of the article is as follows: section two gives the overview of the data; the third section shortly describes the distributed representations of words. The methods of the aspect term extraction and polarity detection are presented in the fourth and the fifth sections respectively. The conclusions are given in the sixth section.

The Data
The organisers provided the train data for restaurant and laptop domains. But as it will be clear further our methods are heavily dependent on unlabelled text data. So we additionally collected the user reviews about restaurants from tripadviser.com and about laptops from amazon.com. General statistics of the data are shown in Ta  For all the data we performed tokenization, stemming and morphological analysis using the FreeLing library (Padró and Stanilovsky, 2012).

Distributed Representations of Words
In this section we'll try to give the high level idea of the distributed representations of words. The more technical details can be found in (Mikolov et al., 2013).
It is closely related with a new promising direction in machine learning called the deep learning. The core idea of the unsupervised deep learning algorithms is to find automatically the "good" set of features to represent the target object (text, image, audio signal, etc.). The object represented by the vector of real numbers is called the distributed representation (Rumelhart et al., 1986). We used the skip-gram model (Mikolov et al., 2013) implemented in Gensim toolkit (Řehůřek and Sojka, 2010).
In general the learning procedure is as follows. All the texts of the corpus are stuck together in a single sequence of sentences. On the basis of the corpus the lexicon is constructed. Next, the dimensionality of the vectors is chosen (we used 300 in our experiments). The greater number of dimensions allows to capture more language regularities but leads to more computational complexity of the learning. Each word from the lexicon is associated with the real numbers vector of the selected dimensionality. Originally all the vectors are randomly initialized. During the learning procedure the algorithm "slides" with the fixed size window (it's algorithm parameter that was retained by default -5 words) along the words of the sequence and calculates the probability (1) of context words appearance within the window based on its central word under review (or more precisely, its vector representation) (Mikolov et al., 2013). The ultimate goal of the described process is to get such "good" vectors for each word, which allow to predict its probable context. All such vectors together form the vector space where semantically similar words are grouped.

Aspect Term Extraction Method
We apply the same method for the aspect term extraction task (Pontiki et al., 2014) for both domains. The method consists of two steps: the candidate selection and the term extraction.

Candidate Selection
First of all we collect some statistics about the terms in the train collection. We analysed two facets of the aspect terms: the number of words and their morphological structure. The information about the number of words in a term is shown in Table 2. On the basis of that we've decided to process only single and two-word aspect terms. From the single terms we treat only singular (NN, e.g. staff, rice, texture, processor, ram, insult) and plural nouns (NNS, e.g. perks, bagels, times, dvds, buttons, pictures) as possible candidates, because they largely predominate among the one-word terms. All conjunctions of the form NN_NN (e.g. sea_bass, lotus_leaf, chicken_dish, battery_life, virus_protection, custom-er_disservice) and NN_NNS (e.g. sushi_places, menu_choices, seafood_lovers, usb_devices, re-covery_discs, software_works) were candidates for the two-word terms also because they are most common in two-word aspect terms.

Term Extraction
The second step for the aspect term identification is the term extraction. As has already been told the space (see Section 3) specifies the word groups. Therefore the measure of similarity between the words (vectors) can be defined. For NLP tasks it is often the cosine similarity measure. The similarity between two vectors where the angle between the vectors, nthe dimensionality of the space.
In case of the restaurant domain the category and aspect terms are specified. For each category the seed of the aspect terms can be automatically selected: if only one category is assigned for a train sentence then all its terms belong to it. Within each set the average similarity between the terms (the threshold category) can be found.
For the new candidate the average similarities with the category's seeds are calculated. If it is greater than the threshold of any category than the candidate is marked as an aspect term.
Also we've additionally applied some rules:  Join consecutive terms in a single term.  Join neutral adjective ahead the term (see Section 5.2 for clarification about the neutral adjective).  Join fragments matching the pattern: <an aspect term> of <an aspect term>. In case of the laptop domain there are no specified categories so we treated all terms as the terms belonging to one general category. And the same procedure with candidates was performed.

Category Detection
For the restaurant domain there was also the aspect category detection task (Pontiki et al., 2014).
Since each word is represented by a vector, each sentence can be cast to a single point as the average of its vectors. Further average point for each category can be found by means of the sentence points. Then for an unseen sentence the average point of its word vectors is calculated. The category is selected by calculating the distances between all category points and a new point and by choosing the minimum distance.

Results
The aspect term extraction and the aspect category detection tasks were evaluated with Precision, Recall and F-measure (Pontiki et al., 2014). The F-measure was a primary metric for these tasks so we present only it.
The result of our method ranked 19 out of 28 submissions (constrained and unconstrained) for the aspect term extraction task for the laptop domain and 17 out of 29 for the restaurant domain. For the category detection task (restaurant domain) the method ranked 9 out of 21. Table 3 shows the results of our method (Bold) for aspect term extraction task in comparison with the baseline (Pontiki et al., 2014) and the best result. Analogically the results for the aspect category detection task are presented in Table 4.

Polarity Detection Method
Our polarity detection method also exploits the vector space (from Section 3) because the emotional similarity between words can be traced in it. As with the aspect term extraction method we follow two-stage approach: the candidate selection and the polarity detection.

Candidate Selection
All adjectives and verbs are considered as the polarity term candidates. The amplifiers and the negations have an important role in the process of result polarity forming. In our method we took into account only negations because it strongly affects the word polarity. We've joined into one unit all text fragments that match the following pattern: not + <JJ | VB>.

Term Polarity Detection
At first we manually collected the small etalon sets of positive and negative words for each domain. Every set contained 15 words that clearly identify the sentiment. For example, for the positive polarity there were words such as: great, fast, attentive, yummy, etc. and for the negative polarity there were words like: terrible, ugly, not_work, offensive, etc. By measuring the average similarity for a candidate to the positive and the negative seed words we decided whether it is positive (+1) or negative (-1). Also we set up a neutral threshold and a candidate's polarity was treated as neutral (0) if it didn't exceed the threshold.
For each term (within the window of 6 words) we were looking for its closest polarity term candidate and sum up their polarities. For the final decision about the term's polarity there were some conditions:  If sum > 0 then positive.  If sum < 0 then negative.  If sum == 0 and all polarity terms are neutral then neutral else conflict.

Category Polarity Detection
By analogy with the category detection method, using the train collection, we calculate the average polarity points for each category, i.e. there were 5×4 such points (5 categories and 4 values of polarity). Then a sentence was cast to a point as the average of all its word-vectors. And closest polarity points for the specified categories defined the polarity.

Results
The results of our method (Bold) for the polarity detection tasks are around the baseline results for the Accuracy measure (Tables 5, 6).  However the test data is skewed to the positive class and for that case the Accuracy is a poor indicator. Because of that we also show macro Fmeasure results for our and baseline methods (Tables 7, 8).  From that we can conclude that our method of the polarity detection more delicately deals with the minor represented classes than the baseline method.

Conclusion
In the article we presented the methods for two main subtasks for aspect-based sentiment analysis: the aspect term extraction and the polarity detection. The methods are based on the distributed representation of words and the notion of similarities between the words.
For the aspect term extraction and category detection tasks we get satisfied results which are consistent with our cross-validation metrics. Unfortunately for the polarity detection tasks the result of our method by official metrics are low. But we showed that the proposed method is not so bad and is capable to deal with the skewed data better than the baseline method.