Data and Tools


SemEval 2016 Test Data
(includes Gold Standard and Evaluation Script)

STS Core (English Monolingual subtask) - test data with gold labels (released Feb 25th, 2016) New!
Cross-lingual STS (English-Spanish) - test data with gold labels (released Feb 24th, 2016) New!


SemEval 2016 Evaluation Data

STS Core (English Monolingual subtask)
 (released Jan 25th, 2016; fixes Jan 27th) 

Cross-lingual STS (English-Spanish) Part 1 (News subset) (released Jan 26th, 2016) 

Cross-lingual STS (English-Spanish) Part 2 (released Jan 29th, 2016; fixes late Jan 29th) 


Training and Trial Data


  • STS Core - English monolingual subtask:
    • All pairs released during prior STS evaluations are available as trial and training data.
  • Cross-lingual STS - English/Spanish subtask:

STS Core


Nearly 14 thousand sentence pairs are linked above with gold human annotated STS labels! There are 750 additional machine translation pairs available through the Linguistic Data Consortium (LDC2013T18).


For the 2015 task, 5,500 more pairs were annotated than were used during the official task evaluation. The raw crowdsourced annotations and the script used to create the evaluation set from the raw annotations is also link above. The additional pairs have noisier labels than the 2015 test data as one of the filtering criteria was annotator agreement. 


Evaluation Data for 2016



The evaluation data is being drawn from the following disclosed sources: Corpus of Plagiarised Short Answers (Clough and Stevenson 2011), Stack Exchange Q&A Forums (see Data Dump or Data Explorer), Europe Media Monitor (EMM) (Best et al. 2005), WMT quality estimation shared task (Callison-Burch et al. 2012). Some evaluation data maybe drawn from sources not listed here. As in prior years, we will report performance both in aggregate across all data sources and on each individual data source.


Each evaluation set will include between 250 and 500 sentence pairs, roughly balanced across STS scores. This has been reduced from the 750 pairs per dataset (typical to data that was released in prior years), in order to allow for a more aggressive filtering of pairs exhibiting lower levels of annotator agreement. The evaluation data will be heuristically filtered to remove pairs with STS scores that are trivial to compute using string edit distance or bag-of-word based semantic representations. Watch the task e-mail list for more details.





The Semantic Textual Similarity Wiki details previous tasks and open source software systems and tools.




Clive Best, Erik van der Goot, Ken Blackler, Teofilo Garcia and David Horby. 2005. Europe Media Monitor - System description. In EUR Report 22173-En, Ispra, Italy. [pdf]


Chris Callison-Burch, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut and Lucia Specia. Findings of the 2012 Workshop on Statistical Machine Translation. Proceeding of WMT 2012. [pdf]


Paul Clough and Mark Stevenson. Developing a corpus of plagiarised short answers. Language Resources and Evaluation. Volume 45 Issue 1 (2011) [pdf]

Contact Info

STS Core

Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre

Cross-lingual STS

Carmen Banea, Daniel Cer, Rada Mihalcea, Janyce Wiebe

Wiki: STS Wiki
Discussion Group : STS-semeval

Other Info


  • The official cross-lingual STS results have been posted! New!
  • The gold standard cross-lingual STS files have been released! New!