SubtaskE Task Description

SemEval-2015 Task #10 (Sentiment Analysis on Twitter) Subtask E (Determining strength of association of Twitter terms with positive sentiment)


Version 1: May 29, 2014
Updated on: September 23, 2014


Task Description

Given a word or a phrase, provide a score between 0 and 1 that is indicative of its strength of association with positive sentiment. A score of 1 indicates maximum association with positive sentiment (or least association with negative sentiment) and a score of 0 indicates least association with positive sentiment (or maximum association with negative sentiment). If a word is more positive than another, then it should have a higher score than the other.



Many of the top performing sentiment analysis systems in recent SemEval competitions (2013 Task 2, 2014 Task 4, and 2014 Task 9) rely on automatically generated sentiment lexicons. Sentiment lexicons are lists of words (and phrases) with prior associations to positive and negative sentiments. Some lexicons can additionally provide a sentiment score for a term to indicate its strength of evaluative intensity. Higher scores indicate greater intensity. Existing manually created sentiment lexicons tend to only have discrete labels for terms (positive, negative, neutral) but no real-valued scores indicating the intensity of sentiment. Here for the first time we manually create a dataset of words with real-valued scores of intensity. The goal of this task is to evaluate automatic methods of generating sentiment lexicons, especially those that also produce real-valued scores of sentiment intensity or association.


Target Terms

The target terms were chosen from English tweets posted in 2011, 2012, and 2013. They may be single words or phrases. Some terms are hashtagged words such as '#loveumom'. Some terms may be misspelled or have creative spellings such as those usually seen in tweets (for example, 'parlament' or 'happeeee'. Some terms may be abbreviations, shortenings, or slang. Some terms are negated expressions such as 'no fun'. All of these terms were manually annotated to obtain their strength of association scores.

The trial dataset includes 200 instances. No training data will be provided. There will be about 1300 instances in the testset (to be released later). The trial data is large enough to be used as a development set. It can even be used for training. (Note: the test data and the trial data have no terms in common.) You are free to use any additional manually or automatically generated resources; however, all resources must be clearly identified in the submission files and the system description paper. 


Trial Data Format

The trial dataset has the following format:

  • Each line corresponds to a unique term (single word or phrase)
  • Each line has the format: term<tab>score
    where 'score' is the strength of association with positive sentiment---a number between 0 and 1.


Training Data 

No training data will be released. You are free to use the trial data for training.


Test Data and System Submission Format

The test set will have one term per line in random order. Your submission should have the same format as the trial dataset:

  • Each line should correspond to a unique term
  • Each line should have the format: term<tab>score
    where 'score' is the strength of association with positive sentiment---a number between 0 and 1.

The terms in the test file can be in any order (following the order of the terms in the test file or re-organizing terms in ascending or descending order of sentiment scores are reasonable options, but not obligatory).



System ratings for terms are evaluated by first ranking the terms according to sentiment score and then comparing this ranked list to a ranked list obtained from human annotations. Kendall's Tau will be used as the metric to compare the ranked lists. (We will provide scores for Spearman's Rank Correlation as well, but participating teams will be ranked by Kendall's Tau.)

We have released an evaluation script so that participants can:

  • make sure the output is in the right format
  • track progress of their system's performance on the trial data


Creation of the gold data annotations: MaxDiff method

For people, assigning a score indicating the degree of sentiment is not natural. Different people may assign different scores to the same target item, and it is hard for even the same annotator to remain consistent when annotating a large number of items. In contrast, it is easier for annotators to determine whether one word is more positive (or more negative) than the other. However, the latter requires a much larger number of annotations than the former. MaxDiff is an annotation scheme that retains the comparative aspect of annotation while still requiring only a small number of annotations (Louviere, 1991).


The annotator is presented with four words and asked which word is the most positive and which is the least positive. By answering just these two questions five out of the six inequalities are known. Consider a set in which a respondent evaluates: A, B, C and D. If the respondent says that A is most positive and D is least positive, these two responses inform us that:
                           A > B, A > C, A > D, B > D, C > D
The responses to the maxdiff questions can then be easily translated into a ranking of all the terms and also a real-valued score for all the terms. The MaxDiff method is widely used in market survey questionnaires (Almquist & Lee, 2009). It was also used for determining relation similarity of pairs of items by Jurgens, Mohammad, Turney, and Holyoak (2012) in a SemEval-2012 shared task.


We used MaxDiff questions to generate the gold dataset of words and associated sentiment scores. Further details in:

           Sentiment Analysis of Short Informal Texts. Svetlana Kiritchenko, Xiaodan Zhu and
           Saif Mohammad. Journal of  Artificial Intelligence Research, vol. 50, pages 723-762.



  • Kendall, M. (1938). "A New Measure of Rank Correlation". Biometrika 30 (1–2): 81–89. doi:10.1093/biomet/30.1-2.81. JSTOR 2332226.
  • Nelsen, R.B. (2001), "Kendall tau metric", in Hazewinkel, Michiel, Encyclopedia of Mathematics, Springer, ISBN 978-1-55608-010-4
  • Svetlana Kiritchenko, Xiaodan Zhu and Saif Mohammad. Sentiment Analysis of Short Informal Texts. Journal of Artificial Intelligence Research, vol. 50, pages 723-762.
  • Jordan J. Louviere. 1991. Best-worst scaling: A model for the largest difference judgments. Working Paper.
  • The MaxDiff System Technical Paper - Sawtooth Software.
  • Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets. In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), June 2013, Atlanta, USA.
  • Preslav Nakov, Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin Stoyanov, and Theresa Wilson. 2013. In Semeval-2013 Task 2: Sentiment Analysis in Twitter. Proceedings of the 7th International Workshop on Semantic Evaluation. Association for Computational Linguistics

Contact Info

  • Sara Rosenthal, Columbia University
  • Alan Ritter, The Ohio State University
  • Veselin Stoyanov, Facebook
  • Svetlana Kiritchenko, NRC Canada
  • Saif Mohammad, NRC Canada
  • Preslav Nakov, Qatar Computing Research Institute


Other Info