Interpretable STS

New standalone task for 2016

Interpretable STS task is available at


Given two sentences of text, s1 and s2, STS systems compute how similar s1 and s2 are, returning a similarity score. Although the score is useful for many tasks, it does not allow to know which parts of the sentences are equivalent in meaning (or very close in meaning) and which not.

With this pilot task we want to start exploring whether participant systems are able to explain WHY they think the two sentences are related / unrelated, adding an explanatory layer to the similarity score. As a first step in this direction, given a pair of sentences, participating systems will need to align the chunks in s1 to the chunks in s2, describing what kind of relation exists between each pair of chunks, and a score for the similarity/relatedness between the pair of chunks.

Being a pilot, our goal is to have the final task ready for 2016. We do expect feedback from participants both on the design of the pilot, annotation guidelines and evaluation, which we would like to use for improving the task next year.

NEW: The trial data has been superseded by the train data. Since the release of the trial data we have refined the guidelines and annotation procedure. We also include the instructions for participation and evaluation.

Related work
Brockett (2006) and Rus et. al (2012) produced a dataset where corresponding words (including some multiword expressions like named-entities) were aligned. Although this alignment is useful, we wanted to move forward to alignment of segments, and decided to align chunks (Abney, 1991). Brocket (2006) did not label alignments, while Rus et. al (2012) defined a basic typology. In our task we provide a more detailed typology for the aligned chunks, as well as a similarity/relatedness score for each alignment. Contrary to these works, we first identify the segments (chunks in our case) in each sentence separately, and then align.

In a different strand of work, Nielsen (2009) defined a textual entailment model where the “facets” (words under some syntactic/semantic relation) in the response of a student were linked to concepts in the reference answer. The link would signal whether each facet in the response was entailed by the reference answer or not, but would not explicitly mark which parts of the reference answer caused the entailment. This model was later followed by Levy et al. (2013). Our task is different in that we do identify corresponding chunks in both sentences. We think that, in the future, the aligned facets could provide complementary information to chunks.

Participation in the task
Given the input (pairs of sentences) participants need first to identify the chunks in each sentence, and then, align the corresponding chunks. The chunks are based on those used in the CoNLL 2000 chunking task (Abney 1991, Tjong et al. 2000), with some adaptations (see annotation guidelines).

On this pilot only 1:1 alignments and unaligned chunks are allowed. We are aware that 1:1 chunk alignment presents some limitations, as we won’t be able to explicitly represent all interactions between chunks. A special tag (ALIC, see below) will be used to mark which chunks would be allowed if N:M alignments would be allowed.

For each alignment, the participants need to specify the following:

(1) A similarity/relatedness score between the aligned chunks, from 5 (maximum similarity/relatedness) to 0 (no relation at all):

  • 5 if the meaning of both chunks is equivalent
  • [4,3] iff the meaning of both chunks is very similar or closely related
  • [2,1] iff the meaning of both chunks is slightly similar or somehow related
  • 0 iff the meaning of both chunks is completely unrelated.

(2) Type of the alignment:

  1. EQUI: both chunks are semantically equivalent in the context.
  2. OPPO: the meanings of the chunks are in opposition to each other in the context.
  3. SPE1 and SPE2: both chunks have similar meanings, but chunk in sentence1 is more specific than chunk in sentence2; and, vice versa.
  4. SIM: similar meanings, but no EQUI, OPPO, SPE1, or SPE2.
  5. REL: related meanings. but no SIM,  EQUI, OPPO, SPE1, or SPE2.
  6. ALIC: this chunk has not any corresponding chunk in the other sentence because of the 1:1 alignment restriction, but otherwise the chunk would be aligned to some other chunk.
  7. NOALI: this chunk has no corresponding chunk in the other sentence.

(3) An optional tag for alignments which show factuality (FACT) or polarity (POL) phenomena

Regarding the relation between score and type: The scores are assigned before labels by the human annotators, but there are some interactions. Scores for ALIC and NOALI will be ignored. EQUI should have a 5 score. The rest should have a score bigger than 0 but lower than 5.

Please check the guidelines for more details on the score and alignment types.

Participants can use the train data to develop their systems. The train data contains sentence pairs from the image and headlines datasets used in previous STS tasks (see this file). The test data will be also derived from the image and headlines datasets.

There will be two separate subtracks:

  • Raw input: participants need to identify the chunks, and then do the alignment
  • Chunked input: the input will be split in gold standard chunks, and participants focus on the alignment

Participant teams will be allowed to submit three runs for each subtrack. Runs that fail the well-formedness check (see below) will be discarded.

Input format

The input consists of two files:
- a file with the first sentences in each pair
- a file with the second sentences in each pair

The sentences are tokenized.

Please check STSint.input.*.sent1.txt and STSint.*.input.sent2.txt

Participants can also use the input sentences with gold standard chunks:
- a file with the first sentences in each pair, with [ and ] to mark chunks
- a file with the second sentences in each pair, with [ and ] to mark chunks

Please check STSint.input.*.sent1.chunk.txt and STSint.input.*.sent2.chunk.txt

The gold standard annotation format is the word alignment format (.wa files), an XML file as produced by

Gold standard annotation format

We slightly modified the format to also include the score. Each alignment is reported in one line as follows:

  token-id-seq1 <==> token-id-seq2 // type // score // comment


  • token-id-seq1 is a sequence of token indices (starting at 1) for the chunk in sentence 1 (or 0 if the chunk in sentence 2 is not aligned or is ALIC)
  • token-id-seq2 is a sequence of token indices (starting at 1) for the chunk in sentence 2 (or 0 if the chunk in sentence 1 is not aligned or is ALIC)
  • type is composed of one of the obligatory labels, concatenated to the optional ones by '_'
  • score is a number from 0 to 5, or NIL (if type label is NOALI or ALIC)
  • comment is any string

Please check*.wa

Answer format

The same format as the gold standard alignment has to be used.  Only the alignment section of the XML file will be used. The source and target sections will be ignored (so any system using different token numbers would be penalized). The sentence id is very important, as it will be used to find the corresponding gs pair.

Please check STSint.output.wa

You can check for well-formedness using the provided script as follows:

    $ ./
    $ ./
    $ ./ evalsamples/gs.wa
    $ ./ evalsamples/output.gschunk.wa
    $ ./ evalsamples/output.syschunk.wa

Answer files which fail for well-formedness using the script above will be automatically discarded from evaluation.

The same program prints several statistics:

    $ ./ --stats=1
    $ ./ --stats=1


The official evaluation is based on (Melamed, 1998), which uses the F1 of precision and recall of token alignments (in the context of alignment for Machine Translation). Fraser and Marcu (Fraser and Marcu, 2007) argue that F1 is a better measure than Alignment Error Rate.

The idea is that, for each pair of chunks that are aligned, we consider that any pairs of tokens in the chunks are also aligned with some weight. The weight of each token-token alignment is the inverse of the number of alignments of each token (so-called fan out factor, Melamed, 1998). Precision is measured as the ratio of token-token alignments that exist in both system and gold standard files, divided by the number of alignments in the system. Recall is measured similarly, as the ratio of token-token alignments that exist in both system and gold-standard, divided by the number of alignments in the gold standard. Precision and recall are evaluated for all alignments of all pairs in one go.

The script provides four evaluation measures:

  • F1 where alignment type and score are ignored
  • F1 where alignment types need to match, but scores are ignored
  • F1 where alignment type is ignored, but each alignment is penalized when scores do not match
  • F1 where alignment types need to match, and each alignment is penalized when scores do not match

When run with the debugging flag on, the script prints detailed scores. It also computes the precision and recall scores by pair (for illustration purposes only).

See the header of for the exact formula.

Examples of use:

   # check a system which is based on correct chunks
   $ ./ evalsamples/gs.wa evalsamples/output.gschunk.wa
   $ ./ examples/gs.wa evalsamples/gs.wa evalsamples/output.syschunk.wa

   # detailed scores, including illustrative performance per pair
   $ ./ --debug=1 evalsamples/gs.wa evalsamples/output.gschunk.wa
   $ ./ --debug=1 evalsamples/gs.wa evalsamples/output.syschunk.wa 


  • Abney, S. (1991). Parsing by Chunks. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht.
  • Brockett C. (2007). Aligning the RTE Corpus. Technical Report MSR-TR-2007-77, Microsoft Research.
  • Alexander Fraser and Daniel Marcu. (2007) Measuring Word Alignment Quality for Statistical Machine Translation. Computational Linguistics 2007 33:3, 293-303.
  • Dan Melamed. (1998) Manual annotation of translational equivalence: The  blinker project. Technical Report 98-07, Institute for Research in Cognitive Science, Philadelphia
  • Levy, O., Zesch, T., Dagan, I., Gurevych, I. (2013). Recognizing Partial Textual Entailment. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 451–455, Sofia, Bulgaria, August 4-9 2013.
  • Rodney D. Nielsen, Wayne Ward and James H. Martin. (2009). Recognizing entailment in intelligent tutoring systems. In Ido Dagan, Bill Dolan, Bernardo Magnini and Dan Roth (Eds.):The Journal of Natural Language Engineering, (JNLE), 15, pp 479-501. Copyright Cambridge University Press, Cambridge, United Kingdom.
  • Rus, V., Lintean, M., Moldovan, C., Baggett, W., Niraula, N., & Morgan, B. (2012). The SIMILAR Corpus: A Resource to Foster the Qualitative Understanding of Semantic Similarity of Texts. In Semantic Relations II: Enhancing Resources and Applications, The 8th Language Resources and Evaluation Conference (LREC 2012), May (pp. 23-25).
  • Erik F Tjong Kim Sang, Sabine Buchholz, (2000). Introduction to the CONLL-2000 shared task: Chunking. Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning-Volume 7

Contact Info

email list:

Other Info


  • NEW Nov. 10: final train data for interpretable STS, with updated evaluation script
  • Oct. 16: interpretable STS updated description, train data, guidelines
  • Aug. 15: subtasks with descriptions and trial data available
  • Please fill in SemEval registration form
  • Please join the mailing list for updates