Semantic Textual Similarity for English

Participants in the task will submit the output of systems developed to measure semantic textual similarity in English. Given two sentences of text, s1 and s2, the systems participating in this task should compute how similar s1 and s2 are, returning a similarity score, and an optional confidence score.The annotations and systems will use a scale from 0 (no relation) to 5 (semantic equivalence), indicating the similarity between two sentences. Participating systems will be evaluated using the same metrics traditionally employed in the evaluation of STS systems, and also used in previous Semeval/*SEM STS evaluations, i.e., mean Pearson correlation between the system output and the gold standard annotations.

Please note the following details:

The trial dataset comprises the 2012 and 2013 datasets, which can be used to develop and train systems.

We include sample data for the test datasets, coming from the following:

1) image description (image)
2) OntoNotes and WordNet sense definition mappings (OnWN)
3) news title and tweet comments (tweet-news)
4) deft discussion forum and news (deft-forum and deft-news)
5) news headlines (headlines)

The trial data is a small subset of the sentence pairs that will be used as test data, with (dummy) gold standard scores. The goal of these samples is to allow participants to have an idea of which kind of sentences will occur in each of the test datasets.

The datasets has been derived as follows:

  • STS.input.image.txt: The Image Descriptions data set is a subset of the PASCAL VOC-2008 data set (Rashtchian et al., 2010) . PASCAL VOC-2008 data set consists of 1,000 images and has been used by a number of image description systems. The image captions of the data set are released under a CreativeCommons Attribution-ShareAlike license, the descriptions itself are free.
  • STS.input.OnWN.txt: The sentences are sense definitions from WordNet and OntoNotes. 5 pairs of sentences.
  • STS.input.tweet-news.txt: The tweet-news data set is a subset of the Linking-Tweets-to-News data set (Guo et al., 2013), which consists of 34,888 tweets and 12,704 news articles.  The tweets are the comments on the news articles.  The news sentences are the titles of news articles.
  • STS.output.deft-news.txt: A subset of news article data in the DEFT DARPA project.
  • STS.output.deft-forum.txt: A subset of discussion forum data in the DEFT DARPA project.
  • STS.input.headlines.txt: we used headlines mined from several new sources by European Media Monitor using the RSS feed

NOTE: Participant systems should NOT use the following datasets to develop or train their systems:

  • Ontonotes - Wordnet sense aligned definitions.
  • Data released in (Guo et al., 2013).

Input format

The input file consist of two fields separated by tabs:

- first sentence (does not contain tabs)
- second sentence (does not contain tabs)

Please check any of STS.input.*.txt for more information about formats

Gold Standard

The gold standard contains a score between 0 and 5 for each pair of sentences, with the following interpretation:

(5) The two sentences are completely equivalent, as they mean the same thing. 

      The bird is bathing in the sink. 
      Birdie is washing itself in the water basin.

(4) The two sentences are mostly equivalent, but some unimportant details differ.

      In May 2010, the troops attempted to invade Kabul.
      The US army invaded Kabul on May 7th last year, 2010.

(3) The two sentences are roughly equivalent, but some important information differs/missing.

      John said he is considered a witness but not a suspect.
      "He is not a suspect anymore." John said.

(2) The two sentences are not equivalent, but share some details.

      They flew out of the nest in groups.
      They flew into the nest together.

(1) The two sentences are not equivalent, but are on the same topic.

      The woman is playing the violin.
      The young lady enjoys listening to the guitar.

(0) The two sentences are on different topics.

      John went horse back riding at dawn with a whole group of friends.
      Sunrise at dawn is a magnificent view to take in if you wake up
      early enough for it.

Format: the gold standard file consist of one single field per line:

- a number between 0 and 5

The gold standard in the test data will be assembled using Amazon Mechanical Turk, gathering 5 scores per sentence pair. The gold standard score will the average of those 5 scores. In this trial dataset, this is just a dummy number which you can ignore.

Please check any of STS.*.gs.txt

Answer format
The answer format is similar to the gold standard format, but includes an optional confidence score. Each line has two fields separated by a tab:

- a number between 0 and 5 (the similarity score)
- a number between 0 and 100 (the confidence score)

The use of confidence scores is experimental, and it is not required for the official score.

The official score is based on the average of Pearson correlation. The use of confidence scores will be experimental, and it is not required for the official scores.

Participation in the task
Participant teams will be allowed to submit three runs at most.


Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. Ontonotes: The 90% solution. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL.

Cyrus Rashtchian, Peter Young, Micah Hodosh, and Julia Hockenmaier.Collecting Image Annotations Using Amazon's Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk.

Weiwei Guo, Hao Li, Heng Ji and Mona Diab. 2013. Linking Tweets to News: A Framework to Enrich Online Short Text Data in Social Media.  In Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics


Contact Info

  • STS-En: admin-sts-en{at}googlegroups{dot}com
  • STS-Es: admin-sts-es{at}googlegroups{dot}com

Other Info