Task 6: Detecting Stance in Tweets

Stance detection can be formulated in different ways. In the context of this task, we define stance detection to mean automatically determining from text whether the author is in favor of the given target, against the given target, or whether neither inference is likely. Consider the target--tweet pair:

Target: legalization of abortion
Tweet:  A foetus has rights too! Make your voice heard.

Humans can deduce from the tweet that the speaker is likely against the target. The aim of the task is to test automatic systems in determining whether they can deduce the stance of the tweeter. To successfully detect stance, automatic systems often have to identify relevant bits of information that may not be present in the focus text. For example, that if one is actively supporting foetus rights, then he or she is likely against the right to abortion. We provide a domain corpus pertaining to each of the targets, from which systems can gather information to help with the detection of stance.

Automatically detecting stance has widespread applications in information retrieval, text summarization, and textual entailment. In fact, one can argue that stance detection can often bring complementary information to sentiment analysis, because we often care about the author’s evaluative outlook towards specific targets and propositions rather than simply about whether the speaker was angry or happy.

Twitter and other microblogging sites are a popular platform where people express stance implicitly or explicitly. Thus, here for the first time, we propose a shared task on detecting stance that focuses on the Twitter domain.

TASKS

There are two tasks:

Task A (supervised framework): This task will test stance towards five targets: "Atheism", "Climate Change is a Real Concern", "Feminist Movement", "Hillary Clinton", and "Legalization of Abortion". You are provided with about 2900 labeled training data instances for the five targets.
Task B (weakly supervised framework): This task will test stance towards one target "Donald Trump". You will not be provided with any training data for this target. You are provided with a large set of tweets associated with "Donald Trump" (the domain corpus), but it is not labeled for stance.
You are encouraged to develop unsupervised systems for the targets in Task A so that you can measure progress by using the training data for Task A as development set. However, Task B evaluation will only deal with "Donald Trump" instances.

You can provide submissions for either one of the tasks, or both tasks.

Classes: The possible stance labels are:

FAVOR: We can infer from the tweet that the tweeter supports the target (e.g., directly or indirectly by supporting someone/something, by opposing or criticizing someone/something opposed to the target, or by echoing the stance of somebody else).
AGAINST: We can infer from the tweet that the tweeter is against the target (e.g., directly or indirectly by opposing or criticizing someone/something, by supporting someone/something opposed to the target, or by echoing the stance of somebody else).
NONE: none of the above.

Submission Format: The test data file will have the same format as the training file, except for the class label which will be shown as "UNKNOWN" for all instances. Replace "UNKNOWN" with the predicted class to create the submission file. You may choose to leave the label for an instance as "UNKNOWN", for example if your classifier is unsure of the stance. This might impact recall, but it may still be better than predicting the wrong class (see evaluation metric).

Evaluation: We will use the macro-average of F-score(FAVOR) and F-score(AGAINST) as the bottom-line evaluation metric. An evaluation script has been provided so that you can:

check the format of your submission file
determine performance when gold labels are available (note that you can also use the script to determine performance on a held out portion of the training data to gauge your system's progress)

Each team can make no more than one submission per task. For example, if you are interested in both Task A and Task B, then you can make one submission each for the two tasks. If you are interested only in Task A, then you can make one submission for Task A.

Within a week or two of the end of evaluation period, we will release the gold data. This will allow participants to test outputs from alternative versions of your system. In the system description paper that you will eventually write, we encouraged participants to make comparisons with alternative runs. (If describing multiple runs, participants must clearly mark in the paper which run was submitted to the competition.)

Note that your team's submissions to the Stance task do not limit the number of submissions your team can make to other (non-stance) SemEval tasks.

EVALUATION PERIOD

The Stance task (both Task A and B) will have the following evaluation period: Jan 11th (Mon) to Jan 18 (Mon). That is, test data all be released by 12:00AM Pacific Standard Time (GMT-8) Jan 11, and only submissions made by 11:59PM Pacific Standard Time (GMT-8) Jan 18th will be accepted for evaluation.

RESOURCES THAT CAN BE USED

For Task A: You are free to use any available resources. You are also free to create new resources. For example, you are free to poll the twitter API to collect more tweets pertaining to the targets. However, you will have to clearly outline all the resources you have used at submission. If you use any additional data that is manually labeled for stance towards the targets that are part of this task, or towards entities associated with these targets, then you will be ranked separately from submissions that do not use any stance-labeled data beyond what is provided in the trial and training sets.

For Task B: You are free to use any resources (available or new) as long as you do not use tweets or sentences that are manually labeled for stance. Some very minimal labeling is permitted. For example, manually labeling a handful of hashtags is okay. You will have to clearly outline all the resources you have used at submission.

If you have any questions about the resources that can be used, do not hesitate to ask on the mailing group.

RELATED WORK

Over the last decade, there has been active research in modeling stance. However, most works focus on congressional debates (Thomas et al., 2006) or debates in online forums (Somasundaran and Wiebe, 2009; Murakami and Raymond, 2010; Anand et al., 2011; Walker et al., 2012; Hasan and Ng, 2013; Sridhar, Getoor, and Walker, 2014), the domains in which the gold labels can easily be obtained. Faulkner (2014) investigates the problem of detecting document-level argument stance in student essays. Twitter presents a new challenge to the research community since tweets are short, informal, full of misspellings, shortenings, and slang. Rajadesingan and Liu (2014) aim to identify the stance of Twitter users from their tweets debating a controversial topic. The task we propose aims to detect stance from individual tweets, without relying on conversational structure which is often present in online debates. Nonetheless, this task has clear overlap with related tasks such as argument mining, sentiment analysis, and textual entailment.

RELATION WITH SENTIMENT ANALYSIS

Stance detection is related to sentiment analysis, but the two have significant differences. In sentiment analysis, systems determine whether a piece of text is positive, negative, or neutral. However, in stance detection, systems are to determine the author's favorability towards a given target. The target may or may not be explicitly mentioned in the text. And the text may express opinion or sentiment about some other entity. For example, consider the target and text pair shown below:

Target: Hillary Clinton
Tweet: Jebb Bush is the only sane candidate for 2016.

The tweet expresses positive opinion towards Jebb Bush, but one can also infer from it that the tweeter is probably against Hillary Clinton. Note that even though it is possible to favor both Jebb and Hillary, in this task, we ask what is more probable.

We encourage participation of sentiment analysis systems that test the extent to which simple sentiment analysis will work for this task, as well as modfied sentiment analysis systems focused on determining stance.

RELATION WITH TEXTUAL INFERENCE/ENTAILMENT

This task can be thought of as a textual inference or entailment task, where the goal is to determine whether the favoribility of the target is entailed by the tweet. We encourage participation of such textual inference systems.

REFERENCES

Papers describing the SemEval-2016 stance task, data, participating systems, and baseline systems:

Semeval-2016 Task 6: Detecting Stance in Tweets. Saif M. Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. In Proceedings of the International Workshop on Semantic Evaluation (SemEval ’16). June 2016. San Diego, California.
Saif M. Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. 2016. Stance and sentiment in tweets. Special Section of the ACM Transactions on Internet Technology on Argumentation in Social Media, In press.
Detecting Stance in Tweets And Analyzing its Interaction with Sentiment. Parinaz Sobhani, Saif M. Mohammad, and Svetlana Kiritchenko. In Proceedings of the Joint Conference on Lexical and Computational Semantics (*Sem), August 2016, Berlin, Germany.

Papers from before the SemEval-2016 task:

Anand, P., Walker, M., Abbott, R., Tree, J. E. F., Bowmani, R., and Minor, M. 2011. Cats rule and dogs drool!: Classifying stance in online debate. In Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis, WASSA ’11, pages 1–9.
Faulkner, Adam. 2014. Automated Classification of Stance in Student Essays: An Approach Using Stance Target Information and the Wikipedia Link-Based Measure. In Proceedings of the Twenty-Seventh International Flairs Conference.
Schneider, J., Groza, T., & Passant, A. (2013). A review of argumentation for the social semantic web. Semantic Web, 4(2), pages 159-218.
Hasan, K. S., and Ng, V. 2013. Stance classification of ideological debates: Data, models, features, and constraints. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pages 1348–1356.
Kiritchenko, S., Zhu, X., and Mohammad, S. 2014. Sentiment Analysis of Short Informal Texts. Journal of Artificial Intelligence Research, vol. 50, pages 723-762.
Mohammad, Saif M, and Kiritchenko, Svetlana, and Zhu, Xiaodan. NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets, In Proceedings of the seventh international workshop on Semantic Evaluation Exercises (SemEval-2013), June 2013, Atlanta, USA.
Mohammad, Saif M. 2015. Sentiment Analysis: Detecting Valence, Emotions, and Other Affectual States from Text. Emotion Measurement.
Murakami, A., and Raymond, R. 2010. Support or Oppose? Classifying Positions in Online Debates from Reply Activities and Opinion Expressions. In Proceedings of the International Conference on Computational Linguistics (ACL), pages 869–875.
Rajadesingan, Ashwin, and Huan Liu. 2014. Identifying Users with Opposing Opinions in Twitter Debates. Social Computing, Behavioral-Cultural Modeling and Prediction. Springer International Publishing, pages 153-160.
Recasens, M., Danescu-Niculescu-Mizil, C., & Jurafsky, D. (2013, August). Linguistic Models for Analyzing and Detecting Biased Language. In ACL, pages 1650-1659.
Somasundaran, Swapna and Wiebe, Janyce. 2009. Recognizing stances in online debates. In Proceedings of ACL/AFNLP, pages 226–234.
Sridhar, Dhanya, Getoor, Lise, and Walker, Marilyn. 2014. Collective Stance Classification of Posts in Online Debate Forums. In Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media, pages 109-117
Thomas, M., Pang, B., and Lee, L. 2006. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 327–335.
Walker, M. A.; Anand, P.; Abbott, R.; and Grant, R. 2012. Stance classification using dialogic properties of persuasion. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 592–596.
Wyner, Adam, and Schneider, Jodi. "Arguing from a Point of View." In Proceedings of the First International Conference on Agreement Technologies. 2012.

SemEval-2016 Task 6