Detailed Task Description



Semantic textual similarity (STS) has received an increasing amount of attention in recent years, culminating with the Semeval/*SEM tasks organized in 2012, 2013, 2014 and 2015, bringing together more than 60 participating  teams. Please check for more details on previous tasks.


Given two sentences of text, s1 and s2, the systems participating in STS compute how similar s1 and s2 are, returning a similarity score. Although the score is useful for many tasks, it does not allow to know which parts of the sentences are equivalent in meaning (or very close in meaning) and which not.

The 2015 STS task offered a pilot subtask on interpretable STS. With the pilot task we wanted to explore whether STS systems are able to explain WHY they think the two sentences are related / unrelated, adding an explanatory layer to the similarity score. As a first step in this direction, participating systems  aligned the segments in one sentence in the pair to the segments in the other sentence, describing what kind of relation existed between each pair of segments.

For 2016, the pilot subtask has been updated into a standalone task, with new training and evaluation sets. . If you participated on the STS2015 interpretable subtask please check the updates being made to the task for 2016.

Being a new task, our goal is to continue improving and working on it. We do expect feedback from participants both on the design of the task, annotation guidelines and evaluation, which we would like to use to continue its development.


General description

Given the input (pairs of sentences) participants need first to identify the chunks in each sentence, and then, align the corresponding chunks. The chunks are based on those used in the CoNLL 2000 chunking task (Abney 1991, Tjong et al. 2000), with some adaptations (see annotation guidelines).

For each alignment, the participants need to specify the following:

(1) A similarity/relatedness score between the aligned chunks, from 5 (maximum similarity/relatedness) to 0 (no relation at all):

  • 5 if the meaning of both chunks is equivalent
  • [4,3] iff the meaning of both chunks is very similar or closely related
  • [2,1] iff the meaning of both chunks is slightly similar or somehow related
  • 0 iff the meaning of both chunks is completely unrelated.

(2) Type of the alignment:

  •     EQUI: both chunks are semantically equivalent in the context.
  •     OPPO: the meanings of the chunks are in opposition to each other in the context.
  •     SPE1 and SPE2: both chunks have similar meanings, but chunk in sentence1 is more specific than chunk in sentence2; and, vice versa.
  •     SIMI: similar meanings, but no EQUI, OPPO, SPE1, or SPE2.
  •     REL: related meanings. but no SIM,  EQUI, OPPO, SPE1, or SPE2.
  •     NOALI: this chunk has no corresponding chunk in the other sentence.

(3) An optional tag for alignments which show factuality (FACT) or polarity (POL) phenomena

Regarding the relation between score and type: The scores are assigned before labels by the human annotators, but there are some interactions. Scores for NOALI will be ignored. EQUI should have a 5 score. The rest should have a score bigger than 0 but lower than 5.

Please check the guidelines for more details on the score and alignment types.

Participants can use the train data to develop their systems.

There will be two separate subtracks:

  •     Raw input: participants need to identify the chunks, and then do the alignment
  •     Chunked input: the input will be split in gold standard chunks, and participants focus on the alignment

Participant teams will be allowed to submit three runs for each subtrack. Runs that fail the well-formedness check (see below) will be discarded.


Input format

The input consists of two files:

  • a file with the first sentences in each pair
  • a file with the second sentences in each pair

The sentences are tokenized.

Please check STSint.input.*.sent1.txt and STSint.*.input.sent2.txt

Participants can also use the input sentences with gold standard chunks:

  • a file with the first sentences in each pair, with '[' and ']' to mark chunks
  •  a file with the second sentences in each pair, with '[' and ']' to mark chunks

Please check STSint.input.*.sent1.chunk.txt and STSint.input.*.sent2.chunk.txt


Gold standard annotation format

The gold standard annotation format is the word alignment format (.wa files), an XML-like file as produced by

We slightly modified the format to also include the score. Each alignment is reported in one line as follows:

  token-id-seq1 <==> token-id-seq2 // type // score // comment


  • token-id-seq1 is a sequence of token indices (starting at 1) for the chunk(s) in sentence 1 (or 0 if the chunk in sentence 2 is not aligned)
  • token-id-seq2 is a sequence of token indices (starting at 1) for the chunk(s) in sentence 2 (or 0 if the chunk in sentence 1 is not aligned)
  • type is composed of one of the obligatory labels, concatenated to the optional ones by '_'
  • score is a number from 0 to 5, or NIL (if type label is NOALI)
  • comment is any string

Please check*.wa


Answer format

The same format as the gold standard alignment has to be used.  Only the alignment section of the XML file will be used. The source and target sections will be ignored (so any system using different token numbers would be penalized). The sentence id is very important, as it will be used to find the corresponding gs pair.

Please check STSint.output.wa

You can check for well-formedness using the provided script as follows:

    $ ./
    $ ./
    $ ./ evalsamples/gs.wa
    $ ./ evalsamples/output.gschunk.wa
    $ ./ evalsamples/output.syschunk.wa

Answer files which fail for well-formedness using the script above will be automatically discarded from evaluation.

The same program prints several statistics:

    $ ./ --stats=1
    $ ./ --stats=1



The official evaluation is based on (Melamed, 1998), which uses the F1 of precision and recall of token alignments (in the context of alignment for Machine Translation). Fraser and Marcu (Fraser and Marcu, 2007) argue that F1 is a better measure than Alignment Error Rate.

The idea is that, for each pair of chunks that are aligned, we consider that any pairs of tokens in the chunks are also aligned with some weight. The weight of each token-token alignment is the inverse of the number of alignments of each token (so-called fan out factor, Melamed, 1998). Precision is measured as the ratio of token-token alignments that exist in both system and gold standard files, divided by the number of alignments in the system. Recall is measured similarly, as the ratio of token-token alignments that exist in both system and gold-standard, divided by the number of alignments in the gold standard. Precision and recall are evaluated for all alignments of all pairs in one go.

The script provides four evaluation measures:

    F1 where alignment type and score are ignored
    F1 where alignment types need to match, but scores are ignored
    F1 where alignment type is ignored, but each alignment is penalized when scores do not match
    F1 where alignment types need to match, and each alignment is penalized when scores do not match

When run with the debugging flag on, the script prints detailed scores. It also computes the precision and recall scores by pair (for illustration purposes only).

See the header of for the exact formula.

Examples of use:

   # check a system which is based on correct chunks
   $ ./ evalsamples/gs.wa evalsamples/output.gschunk.wa
   $ ./ evalsamples/gs.wa evalsamples/output.syschunk.wa

   # detailed scores, including illustrative performance per pair
   $ ./ --debug=1 evalsamples/gs.wa evalsamples/output.gschunk.wa
   $ ./ --debug=1 evalsamples/gs.wa evalsamples/output.syschunk.wa



Contact Info


  • Eneko Agirre
  • Aitor Gonzalez-Agirre
  • Inigo Lopez-Gazpio
  • Montse Maritxalar
  • German Rigau
  • Larraitz Uria
  • University of the Basque Country (UPV/EHU)

email :

group : Interpretable STS SemEval

Other Info