Data and Tools
SemEval 2016 Test Data
(includes Gold Standard and Evaluation Script)
STS Core (English Monolingual subtask) - test data with gold labels (released Feb 25th, 2016) New!
Cross-lingual STS (English-Spanish) - test data with gold labels (released Feb 24th, 2016) New!
SemEval 2016 Evaluation Data
STS Core (English Monolingual subtask) (released Jan 25th, 2016; fixes Jan 27th)
Cross-lingual STS (English-Spanish) Part 1 (News subset) (released Jan 26th, 2016)
Cross-lingual STS (English-Spanish) Part 2 (released Jan 29th, 2016; fixes late Jan 29th)
Training and Trial Data
- STS Core - English monolingual subtask:
-
Cross-lingual STS - English/Spanish subtask:
- Trial data: Spanish-English STS Trial Pairs.
STS Core
Nearly 14 thousand sentence pairs are linked above with gold human annotated STS labels! There are 750 additional machine translation pairs available through the Linguistic Data Consortium (LDC2013T18).
For the 2015 task, 5,500 more pairs were annotated than were used during the official task evaluation. The raw crowdsourced annotations and the script used to create the evaluation set from the raw annotations is also link above. The additional pairs have noisier labels than the 2015 test data as one of the filtering criteria was annotator agreement.
Evaluation Data for 2016
The evaluation data is being drawn from the following disclosed sources: Corpus of Plagiarised Short Answers (Clough and Stevenson 2011), Stack Exchange Q&A Forums (see Data Dump or Data Explorer), Europe Media Monitor (EMM) (Best et al. 2005), WMT quality estimation shared task (Callison-Burch et al. 2012). Some evaluation data maybe drawn from sources not listed here. As in prior years, we will report performance both in aggregate across all data sources and on each individual data source.
Each evaluation set will include between 250 and 500 sentence pairs, roughly balanced across STS scores. This has been reduced from the 750 pairs per dataset (typical to data that was released in prior years), in order to allow for a more aggressive filtering of pairs exhibiting lower levels of annotator agreement. The evaluation data will be heuristically filtered to remove pairs with STS scores that are trivial to compute using string edit distance or bag-of-word based semantic representations. Watch the task e-mail list for more details.
Tools
The Semantic Textual Similarity Wiki details previous tasks and open source software systems and tools.
References
Clive Best, Erik van der Goot, Ken Blackler, Teofilo Garcia and David Horby. 2005. Europe Media Monitor - System description. In EUR Report 22173-En, Ispra, Italy. [pdf]
Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut and Lucia Specia. Findings of the 2012 Workshop on Statistical Machine Translation. Proceeding of WMT 2012. [pdf]
Paul Clough and Mark Stevenson. Developing a corpus of plagiarised short answers. Language Resources and Evaluation. Volume 45 Issue 1 (2011) [pdf]