Data, Evaluation, and Results < SemEval-2016 Task 6

Data, Evaluation, and Results

Data:

Training and test data with stance gold labels and additional labels such as 'target of opinion' and 'sentiment' . These additional annotations were not part of the SemEval-2016 competiton, but are made available for future research. Details about this dataset are available in this paper:

Saif M. Mohammad, Parinaz Sobhani, and Svetlana Kiritchenko. 2016. Stance and sentiment in tweets. Special Section of the ACM Transactions on Internet Technology on Argumentation in Social Media, 2017, 17(3).

Test data with gold labels
TEST DATA without gold labels (Submissions are due Jan. 18, 2016, 11:59PM Pacific Standard Time (GMT-8))
Trial data (readme) - can be used for training
Training data for Task A (readme)
Domain corpus for Task B (note: it takes about 5 days to download the tweets in the corpus)

Data Visualization:

An Interactive Visualization of the Stance Datset is now available. It shows various statistics about the data.
Note that it also shows sentiment and target of opinion annotations (in addition to stance).
Clicking on various visualization elements filters the data. For example, clicking on 'Feminism' and 'Favor' will filter all sub-visualizations to show information pertaining to only those tweet that express favor towards feminism. You can also use the check boxes on the left to view only test or training data, or data on particular targets.

Instructions to Annotators

We used this questionnaire to obtain annotations
Annotators were restricted to those living in USA
For the classification tasks A and B, options 3 and 4 listed in the questionaire were collapsed into one class NONE (neither favor nor against)

Evaluation:

We will use the macro-average of F-score(FAVOR) and F-score(AGAINST) as the bottom-line evaluation metric.
Evaluation Script v2 (last updated: January 11th, 2016) (the same script can be used for both task A and Task B)
You can use it to:
-- check the format of your submission file
-- determine performance when gold labels are available (note that you can also use the script to determine performance on a held out portion of the training data to gauge your system's progress)
A separate evaluation script to determine scores on the following subsets of the test set: (a) where the given target of interest is the same as the target of opinion in the tweet, and (b) where the given target of interest is *not* the target of opinion in the tweet. NOTE: This script was not part of the official competition, and is provided only as a means for further analysis of the results.