SemEval-2015 Task 11: Sentiment Analysis of Figurative Language in Twitter



One of the most difficult problems when assigning either positive or negative polarity in sentiment analysis tasks is to accurately determine what is the truth value of a certain statement. In the case of literal language –when there is not a secondary meaning–, existing techniques already achieve good results. However, in case of figurative language such as irony and affective metaphor –when secondary or extended meanings are intentionally profiled– the affective polarity of the literal meaning may contrast sharply with the affect created by the figurative meaning. Nowhere is this effect more pronounced than in ironic or sarcastic language, which delights in using affirmative language to convey negative meanings. Metaphor, irony and figurative language more generally demonstrate the limits of conventional techniques for the sentiment analysis of literal texts.


In this respect, figurative language creates a significant challenge for a sentiment analysis system, as direct approaches based on words and their lexical semantics are often shown to be inadequate in the face of indirect meanings. It would be convenient then if such language were rare and confined to specific genres of text, such as poetry and literature. Yet the reality is that figurative language is pervasive in almost any genre of text, and is especially commonplace on the texts of the Web and in social media communications. Figurative language often draws attention to itself as a creative artifact, but is just as likely to be viewed as part of the general fabric of human communication. In any case, Web users employ figures of speech (both old and new) to project their personality through a text, especially when limited to the 140 characters of a tweet.


Natural language researchers have attacked the problems associated with figurative interpretations at multiple levels of linguistic representation. Some have focused on the conceptual level, of which the text is a surface instantiation, to identify the schemas and mappings that are implied by a figure of speech (e.g. Veale and Keane (1992); Barnden (2010); Veale (2012)). These approaches yield deep insights but not a robustness of analysis in the face of diverse texts. More robust approaches focus more on the surface level of a text, to consider the choice of words, syntactic order, lexical properties and affective profiles of the elements that make up the text (e.g. Reyes and Rosso (2012, 2013)). Surface analysis yields a range of features that can be efficiently extracted and fed into one or more machine-learning algorithms.


When it comes to analyzing the texts of the Web, the Web can also be used as a convenient source of ancillary knowledge and features. Veale and Hao (2007) describe a semi-automatic means of harvesting common-sense knowledge of stereotypes from the Web, by directly targeting simile constructions of the form “as X as Y” (e.g. “as hot as an oven”, “as humid as a jungle”, “as big as a mountain”, etc.). Though largely successful in their efforts, Veale and Hao were surprised to discover that up to 20% of their web-harvested similes were ironic (examples include “as subtle as a freight train”, “as tanned as an Irishman”, “as sober as a Kennedy”, “as private as a park bench” etc.). Initially filtering ironic similes manually – irony is the worst kind of noise one can have when acquiring knowledge from text – Hao and Veale (2010) report good results for an automatic, Web-based approach to distinguishing ironic from non-ironic similes. However, this approach exploits certain properties of similes and is not directly transferrable to the detection of irony in general language. Reyes, Rosso and Veale (2013) and Reyes, Rosso and Buscaldi (2012) report a more general approach that uses Machine Learning over a variety of different structural and lexical features of a micro-text to detect humorousness and irony.



The task concerns itself with the classification of tweets containing irony and metaphors. Our trial, training and test data will contain a concentrated amount of these phenomena to evaluate the degree to which conventional sentiment analysis can handle creative language, and to determine whether systems that explicitly model these phenomena demonstrate a marked increase in competence.

So, given a set of tweets that are rich in metaphor and irony,  the goal is to determine whether the user has expressed a positive, negative or neutral sentiment in each, and the degree to which this sentiment has been communicated. We will use a fine-grained sentiment scale to capture the effect of irony and figurativity on the perceived sentiment of a tweet, and participating systems must assign sentiment scores from the same fine-grained scale to each of the tweets they are given.



Participant systems will be required to provide a fine-grained sentiment score (between -5 to +5) for each tweet in the test set. These predicted scores will be compared to the weighted average of scores provided by human annotators (elicited via the crowd-sourcing platform CrowdFlower).

A vector space will be used to evaluate the similarity of the predictions of each participating system to the human-annotated gold standard. The list of expected gold-standard sentiment scores will be used to construct a normalized gold-standard vector, while a comparable vector will be constructed from the predictions of a participating system. The cosine distance between vectors will then be used as a measure of how well the participating system estimates the gold-standard sentiment scores for the whole of the test set.

Importantly, this means that a system need not replicate the exact scores of the gold-standard to score well overall. Evaluation is continuous, not discrete, and is forgiving of minor differences. If two systems consistently predict the wrong scores, but one is consistently closer to the gold-standard than the other (e.g. predicting -3.1 for a tweet to which annotators have assigned a weighted score of -4.2), then the system that is consistently closer will obtain the highest evaluation score.


Barnden, J. (2010). Metaphor and metonymy: Making the connections more slippery. Cognitive Linguistics 21(1): 1-34.

Hao, Y., Veale, T. (2010). An Ironic Fist in a Velvet Glove: Creative Mis-Representation in the Construction of Ironic Similes. Minds and Machines 20(4):635–650.

Reyes A., Rosso P. (2013). On the Difficulty of Automatically Detecting Irony: Beyond a Simple Case of Negation. Knowledge and Information Systems. DOI: 10.1007/s10115-013-0652-8.

Reyes A., Rosso P., Veale T. (2013). A Multidimensional Approach for Detecting Irony in Twitter. Languages Resources and Evaluation 47(1): 239-268.

Reyes A., Rosso P. (2012). Making Objective Decisions from Subjective Data: Detecting Irony in Customers Reviews. Journal on Decision Support Systems 53(4): 754–760.

Reyes A., Rosso P., Buscaldi D. (2012). From Humor Recognition to Irony Detection: The Figurative Language of Social Media. Data & Knowledge Engineering 74:1-12.

Shutova, E., L. Sun, A. Korhonen. (2010). Metaphor identification using verb and noun clustering. Proceedings of the 23rd International Conference on Computational Linguistics.

Veale, T., Keane, M. T. (1992). Conceptual Scaffolding: A spatially founded meaning representation for metaphor comprehension. Computational Intelligence 8(3): 494-519.

Veale, T. (2012). Detecting and Generating Ironic Comparisons: An Application of Creative Information Retrieval. AAAI Fall Symposium Series 2012, Artificial Intelligence of Humor. Arlington, Virginia.

Veale, T., Hao, Y. (2007). Comprehending and Generating Apt Metaphors: A Web-driven, Case-based Approach to Figurative Language. In proceedings of AAAI 2007, the 22nd AAAI Conference on Artificial Intelligence. Vancouver, Canada.

Contact Info


  • John Barnden ( University of Birmingham, UK.
  • Antonio Reyes ( Superior Institute of Interpreters and Translators
  • Ekaterina Shutova ( ICSI, UC Berkeley
  • Paolo Rosso ( Technical University of Valencia
  • Tony Veale ( ) University College Dublin

email :

Other Info


  • Note: the dates for the evaluation period for SemEval-2015 have changed! (Dec. 5 -- 20, 2014)
  • Training data for this task (8000 figurative tweets annotated with sentiment scores in the range -5...+5) is now available.
  • Trial data for this task (1000 figurative tweets annotated with sentiment scores in the range -5...+5) is now available.
  • Follow @MetaphorMagnet -- a Twitterbot that uses metaphor theory to automatically generate novel metaphors