Task Guidelines
Distributional Semantic Models (DSMs) approximate the meaning of words with vectors summarizing their patterns of co-occurrence in corpora. Recently, several compositional extensions of DSMs (Compositional DSMs, or CDSMs) have been proposed, with the purpose of representing the meaning of phrases and sentences by composing the distributional representations of the words they contain.
The goal of the task is to evaluate CDSMs on a new data set (SICK - Sentences Involving Compositional Knowledge) which includes a large number of sentence pairs that are rich in the lexical, syntactic and semantic phenomena that CDSMs are expected to account for, but do not require dealing with other aspects of existing sentential data sets (multiword expressions, named entities, telegraphic language) that are not within the domain of compositional distributional semantics.
THE DATA SET
The SICK data set consists of 10,000 English sentence pairs, built starting from two existing paraphrase sets: the 8K ImageFlickr data set (http://nlp.cs.illinois.edu/HockenmaierGroup/data.html) and the SEMEVAL-2012 Semantic Textual Similarity Video Descriptions data set (http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data). Each sentence pair was annotated for relatedness in meaning. This score provides a direct way to evaluate CDSMs, insofar as their outputs are meant to quantify the degree of semantic relatedness between sentences. Since detecting the presence of entailment is one of the traditional benchmarks of a successful semantic system, each pair is also annotated for the entailment relation between the two elements.
Both the Training and the Test Set will be composed of 5,000 sentence pairs.
The Training Set will be delivered as a TAB delimited text file structured as follows:
"Pair ID" "sentence A" "sentence B" "semantic relatedness gold label" "textual entailment gold label".
The Test Set will have the same format as the Training Set except for the gold label fields which will be left empty.
Trial data are available at the Data and Tools page.
THE TASK
This challenge involves two sub-tasks:
- predicting the degree of relatedness between two sentences
- detecting the entailment relation holding between them
Participants can submit system runs for one or both sub-tasks.
While we especially encourage developers of CDSMs to test their methods on this benchmark, developers of other kinds of systems that can tackle sentence relatedness or entailment tasks (e.g., full-fledged RTE systems) are also welcome to submit their output. Besides being of intrinsic interest, the latter systems' performance will serve to situate CDSM performance within the broader landscape of computational semantics.
SEMANTIC RELATEDNESS SUB-TASK
Given two sentences, systems are required to produce a relatedness score indicating the extent to which the sentences are expressing a related meaning.
Examples:
Sentence A: A man is jumping into an empty pool
Sentence B: There is no biker jumping in the air
Relatedness score: 1.6
Sentence A: Two children are lying in the snow and are making snow angels
Sentence B: Two angels are making snow on the lying children
Relatedness score: 2.9
Sentence A: The young boys are playing outdoors and the man is smiling nearby
Sentence B: There is no boy playing outdoors and there is no man smiling
Relatedness score: 3.6
Sentence A: A person in a black jacket is doing tricks on a motorbike
Sentence B: A man in a black jacket is doing tricks on a motorbike
Relatedness score: 4.9
Note that the Gold relatedness score is calculated as the average of ten human ratings collected for each pair, and can range from 1 (completely unrelated) to 5 (very related).
Evaluation:
Participants will submit the scores produced by their system for all test sentence pairs. Systems will be primary ranked on the basis of the Pearson correlation between their scores and gold standard ratings. Additional evaluation will be based on mean squared errors (computed on standardized scores) and Spearman correlations.
The correlation of the human ratings with normalized word overlap scores will be provided as a baseline.
TEXTUAL ENTAILMENT SUB-TASK
Given two sentences A and B, systems must determine whether the meaning of B is entailed (can be inferred) from A. According to the standard definition of Textual Entailment, it is said that A entails B if, typically, a human reading A would infer that A is most likely true.
In particular, systems are required to decide whether:
- A entails B (ENTAILMENT judgment)
- A contradicts B (CONTRADICTION judgment)
- The truth of B cannot be determined on the basis of A (NEUTRAL judgment)
Examples:
Sentence A: Two teams are competing in a football match
Sentence B: Two groups of people are playing football
Entailment judgment: ENTAILMENT
Sentence A: The brown horse is near a red barrel at the rodeo
Sentence B: The brown horse is far from a red barrel at the rodeo
Entailment judgment: CONTRADICTION
Sentence A: A man in a black jacket is doing tricks on a motorbike
Sentence B: A person is riding the bicycle on one wheel
Entailment judgment: NEUTRAL
Note that the Gold entailment label is calculated as the majority label of five human judgments collected for each pair.
Evaluation:
Participants will submit the entailment labels predicted by their system for each test sentence pair and will be evaluated in terms of classification accuracy.
Standard baselines will be provided: random, probabilistic (average of N random relation label assignments sampled from the gold relation distribution), majority.
SYSTEM SUBMISSIONS
Participants can submit up to five runs for any of the proposed sub-tasks and must indicate the run considered as primary.
Results are to be submitted as one file per run.
The run filename must contain the official team name (as given by Semeval organisers) and the run number. Moreover, the primary run must be indicated (e.g “TeamName_run1primary.txt” ; “TeamName_run2.txt”).
No partial submissions are allowed, i.e. the submission must cover all the Test Set sentence pairs.
Each run submission must be a text file containing 3 tab-delimited columns:
- pair_ID (the ids, that should match all those in the test data)
- entailment_judgment (predictions of your system for the entailment sub-task; possible values: ENTAILMENT, CONTRADICTION, NEUTRAL)
- relatedness_score (numerical predictions of your system for the sentence relatedness sub-task)
Note that the first line of the file must be a "header" naming the 3 columns exactly with the 3 strings above (pair_ID, entailment_judgment and relatedness_score).
The order of the columns and rows does not matter but the ids must match those in the Test Set.
If you do not participate in the entailment or relatedness task, please provide a column of NA (by that, we mean that you should literally enter the string NA in each row).
Example:
- Semantic Relatedness sub-task only: "Pair ID" "SR predicted score" "NA"
- Textual Entailment sub-task only: "Pair ID" "NA" "TE predicted judgment"
- Both sub-tasks: "Pair ID" "SR predicted score" "TE predicted judgment"
Remember that, for either subtask you want to evaluate, you must provide a value for each test pair: if your submission file does not contain all the Test Set IDs, or if your scores contain NAs or missing values, the corresponding subtask will NOT be evaluated.