Task Description: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment





Distributional Semantic Models (DSMs) approximate the meaning of words with vectors summarizing their patterns of co-occurrence in corpora. Recently, several compositional extensions of DSMs (Compositional DSMs, or CDSMs) have been proposed, with the purpose of representing the meaning of phrases and sentences by composing the distributional representations of the words they contain (e.g., [1], [2], [4], [5]). Despite the ever increasing interest in the field, the development of adequate benchmarks for CDSMs, especially at the sentence level, is still lagging behind. Existing data sets, such as those introduced by [3] and [2], are limited to a few hundred instances of very short sentences with a fixed structure. On the other hand, in the last ten years, several large data sets have been developed for various computational semantics tasks, such as Semantic Text Similarity (STS) or Recognizing Textual Entailment (RTE). Working with such data sets, however, requires dealing with issues, such as identifying multiword expressions, recognizing named entities or accessing encyclopedic knowledge, that are not what CDSMs are expected to handle. The latter should be evaluated on data sets involving difficulties associated to semantic compositionality (e.g., contextual synonymy and other lexical variation phenomena, active/passive and other syntactic alternations, impact of negation, determiners and other grammatical elements), that do not necessarily occur frequently in, e.g., the STS and RTE data sets.

With these considerations in mind, we developed SICK (Sentences Involving Compositional Knowledge), a data set aimed at filling the void, including a large number of sentence pairs that are rich in the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (idiomatic multiword expressions, named entities, telegraphic language) that are not within the scope of compositional distributional semantics. Moreover, we distinguished between generic semantic knowledge about general concept categories (such as knowledge that a couple is formed by a bride and a groom) and encyclopedic knowledge about specific instances of concepts (e.g., knowing the fact that the current president of the US is Barack Obama). The SICK data set contains many examples of the former, but none of the latter.





The SICK data set consists of 10,000 English sentence pairs, each annotated for relatedness in meaning. The sentence relatedness score provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences. Since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system, each pair is also annotated for the entailment relation between the two elements.
This SEMEVAL challenge thus involves two sub-tasks:

  • predicting the degree of relatedness between two sentences
  • detecting the entailment relation holding between them

Participants can submit system runs for one or both sub-tasks. While we especially encourage developers of CDSMs to test their methods on SICK, developers of other kinds of systems that can tackle sentence relatedness or entailment tasks (e.g., full-fledged RTE systems) are also welcome to submit their output. Besides being of intrinsic interest, the latter systems' performance will serve to situate CDSM performance within the broader landscape of computational semantics.


For further information please refer to the Task Guidelines.


If you are interested in the task, please join the Task discussion group.


You can register to the task at the SemEval 2014 website.




[1] Baroni, Marco and Roberto Zamparelli, 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of EMNLP. Boston, MA.

[2] Grefenstette, Edward and Mehrnoosh Sadrzadeh, 2011. Experimental support for a categorical compositional distributional model of meaning. In Proceedings of EMNLP. Edinburgh, UK.

[3] Mitchell, Jeff and Mirella Lapata, 2008. Vector-based models of semantic composition. In Proceedings of ACL. Columbus, OH.

[4] Mitchell, Jeff and Mirella Lapata, 2010. Composition in distributional models of semantics. Cognitive Science, 34(8): 1388–1429.

[5] Socher, Richard, Brody Huval, Christopher Manning, and Andrew Ng, 2012. Semantic compositionality through recursive matrix-vector spaces. In Proceedings of EMNLP. Jeju Island, Korea.





Contact Info


  • Marco Marelli, University of Trento, Italy
  • Stefano Menini, Fondazione Bruno Kessler, Italy
  • Marco Baroni, University of Trento, Italy
  • Luisa Bentivogli, Fondazione Bruno Kessler, Italy
  • Raffaella Bernardi, University of Trento, Italy
  • Roberto Zamparelli, University of Trento, Italy

email : marco.marelli@unitn.it

Other Info


  • We have released the primary runs submitted by participants. You can find them at the Data and Tools page.
  • We have released the general results summarizing the performances of the primary runs of all participating systems. You can found them at the 'Results' page
  • We have released the gold scores of the test set. They can be downloaded from the 'Data and tools' page
  • The test data have been released: you can find them in the 'Data and tools' page
  • We uploaded a new version of the evaluation script: you can find it in the 'Data and tools' page
  • We released a script for computing baselines: you can find it in the 'Data and tools' page
  • The train data and the evaluation script have been released: you can find them in the 'Data and tools' page
  • You can now register to the task at the 'SemEval 2014 website' .
  • Please note that we released a new version of the trial data on December 6th, after Johan Bos pointed out that the earlier release contained some repeated items (thanks Johan!). Nothing fundamental should hinge on the difference.
  • Trial data released: you can find them in the 'Data and tools' page