Detailed Task Description
Introduction
Semantic textual similarity (STS) has received an increasing amount of attention in recent years, culminating with the Semeval/*SEM tasks organized in 2012, 2013, 2014 and 2015, bringing together more than 60 participating teams. Please check http://ixa2.si.ehu.es/stswiki/ for more details on previous tasks.
Given two sentences of text, s1 and s2, the systems participating in STS compute how similar s1 and s2 are, returning a similarity score. Although the score is useful for many tasks, it does not allow to know which parts of the sentences are equivalent in meaning (or very close in meaning) and which not.
The 2015 STS task offered a pilot subtask on interpretable STS. With the pilot task we wanted to explore whether STS systems are able to explain WHY they think the two sentences are related / unrelated, adding an explanatory layer to the similarity score. As a first step in this direction, participating systems aligned the segments in one sentence in the pair to the segments in the other sentence, describing what kind of relation existed between each pair of segments.
For 2016, the pilot subtask has been updated into a standalone task, with new training and evaluation sets. . If you participated on the STS2015 interpretable subtask please check the updates being made to the task for 2016.
Being a new task, our goal is to continue improving and working on it. We do expect feedback from participants both on the design of the task, annotation guidelines and evaluation, which we would like to use to continue its development.
General description
Given the input (pairs of sentences) participants need first to identify the chunks in each sentence, and then, align the corresponding chunks. The chunks are based on those used in the CoNLL 2000 chunking task (Abney 1991, Tjong et al. 2000), with some adaptations (see annotation guidelines).
For each alignment, the participants need to specify the following:
(1) A similarity/relatedness score between the aligned chunks, from 5 (maximum similarity/relatedness) to 0 (no relation at all):
- 5 if the meaning of both chunks is equivalent
- [4,3] iff the meaning of both chunks is very similar or closely related
- [2,1] iff the meaning of both chunks is slightly similar or somehow related
- 0 iff the meaning of both chunks is completely unrelated.
(2) Type of the alignment:
- EQUI: both chunks are semantically equivalent in the context.
- OPPO: the meanings of the chunks are in opposition to each other in the context.
- SPE1 and SPE2: both chunks have similar meanings, but chunk in sentence1 is more specific than chunk in sentence2; and, vice versa.
- SIMI: similar meanings, but no EQUI, OPPO, SPE1, or SPE2.
- REL: related meanings. but no SIM, EQUI, OPPO, SPE1, or SPE2.
- NOALI: this chunk has no corresponding chunk in the other sentence.
(3) An optional tag for alignments which show factuality (FACT) or polarity (POL) phenomena
Regarding the relation between score and type: The scores are assigned before labels by the human annotators, but there are some interactions. Scores for NOALI will be ignored. EQUI should have a 5 score. The rest should have a score bigger than 0 but lower than 5.
Please check the guidelines for more details on the score and alignment types.
Participants can use the train data to develop their systems.
There will be two separate subtracks:
- Raw input: participants need to identify the chunks, and then do the alignment
- Chunked input: the input will be split in gold standard chunks, and participants focus on the alignment
Participant teams will be allowed to submit three runs for each subtrack. Runs that fail the well-formedness check (see below) will be discarded.
Input format
The input consists of two files:
- a file with the first sentences in each pair
- a file with the second sentences in each pair
The sentences are tokenized.
Please check STSint.input.*.sent1.txt and STSint.*.input.sent2.txt
Participants can also use the input sentences with gold standard chunks:
- a file with the first sentences in each pair, with '[' and ']' to mark chunks
- a file with the second sentences in each pair, with '[' and ']' to mark chunks
Please check STSint.input.*.sent1.chunk.txt and STSint.input.*.sent2.chunk.txt
Gold standard annotation format
The gold standard annotation format is the word alignment format (.wa files), an XML-like file as produced by https://www.ldc.upenn.edu/language-resources/tools/ldc-word-aligner.
We slightly modified the format to also include the score. Each alignment is reported in one line as follows:
token-id-seq1 <==> token-id-seq2 // type // score // comment
where:
- token-id-seq1 is a sequence of token indices (starting at 1) for the chunk(s) in sentence 1 (or 0 if the chunk in sentence 2 is not aligned)
- token-id-seq2 is a sequence of token indices (starting at 1) for the chunk(s) in sentence 2 (or 0 if the chunk in sentence 1 is not aligned)
- type is composed of one of the obligatory labels, concatenated to the optional ones by '_'
- score is a number from 0 to 5, or NIL (if type label is NOALI)
- comment is any string
Please check STSint.gs.*.wa
Answer format
The same format as the gold standard alignment has to be used. Only the alignment section of the XML file will be used. The source and target sections will be ignored (so any system using different token numbers would be penalized). The sentence id is very important, as it will be used to find the corresponding gs pair.
Please check STSint.output.wa
You can check for well-formedness using the provided script as follows:
$ ./wellformed.pl STSint.gs.headlines.wa
$ ./wellformed.pl STSint.gs.images.wa
$ ./wellformed.pl evalsamples/gs.wa
$ ./wellformed.pl evalsamples/output.gschunk.wa
$ ./wellformed.pl evalsamples/output.syschunk.wa
Answer files which fail for well-formedness using the script above will be automatically discarded from evaluation.
The same program prints several statistics:
$ ./wellformed.pl STSint.gs.headlines.wa --stats=1
$ ./wellformed.pl STSint.gs.images.wa --stats=1
Scoring
The official evaluation is based on (Melamed, 1998), which uses the F1 of precision and recall of token alignments (in the context of alignment for Machine Translation). Fraser and Marcu (Fraser and Marcu, 2007) argue that F1 is a better measure than Alignment Error Rate.
The idea is that, for each pair of chunks that are aligned, we consider that any pairs of tokens in the chunks are also aligned with some weight. The weight of each token-token alignment is the inverse of the number of alignments of each token (so-called fan out factor, Melamed, 1998). Precision is measured as the ratio of token-token alignments that exist in both system and gold standard files, divided by the number of alignments in the system. Recall is measured similarly, as the ratio of token-token alignments that exist in both system and gold-standard, divided by the number of alignments in the gold standard. Precision and recall are evaluated for all alignments of all pairs in one go.
The script provides four evaluation measures:
F1 where alignment type and score are ignored
F1 where alignment types need to match, but scores are ignored
F1 where alignment type is ignored, but each alignment is penalized when scores do not match
F1 where alignment types need to match, and each alignment is penalized when scores do not match
When run with the debugging flag on, the script prints detailed scores. It also computes the precision and recall scores by pair (for illustration purposes only).
See the header of evalF1.pl for the exact formula.
Examples of use:
# check a system which is based on correct chunks
$ ./evalF1.pl evalsamples/gs.wa evalsamples/output.gschunk.wa
$ ./evalF1.pl evalsamples/gs.wa evalsamples/output.syschunk.wa
# detailed scores, including illustrative performance per pair
$ ./evalF1.pl --debug=1 evalsamples/gs.wa evalsamples/output.gschunk.wa
$ ./evalF1.pl --debug=1 evalsamples/gs.wa evalsamples/output.syschunk.wa
References
- Abney, S. (1991). Parsing by Chunks. In: Robert Berwick, Steven Abney and Carol Tenny (eds.), Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht.
- Agirre, E. and Banea, C. and Cardie, C. and Cer, D. and Diab, M. and Gonzalez-Agirre, A. and Guo, W. and Lopez-Gazpio, I. and Maritxalar, M. and Mihalcea, R. and Rigau, G. and Uria, L. and Wiebe, J. (2015). SemEval-2015 task 2: Semantic textual similarity, English, S-panish and pilot on interpretability. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), June.
- Agirre, E., Gonzalez-Agirre, A., Lopez-Gazpio, I., Maritxalar, M., Rigau, G., & Uria, L. (2016). Semeval-2016 task 2: Interpretable semantic textual similarity. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval 2016), June.
- Brockett C. (2007). Aligning the RTE Corpus. Technical Report MSR-TR-2007-77, Microsoft Research.
- Alexander Fraser and Daniel Marcu. (2007) Measuring Word Alignment Quality for Statistical Machine Translation. Computational Linguistics 2007 33:3, 293-303.
- Dan Melamed. (1998) Manual annotation of translational equivalence: The blinker project. Technical Report 98-07, Institute for Research in Cognitive Science, Philadelphia
- Levy, O., Zesch, T., Dagan, I., Gurevych, I. (2013). Recognizing Partial Textual Entailment. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 451–455, Sofia, Bulgaria, August 4-9 2013.
- Rodney D. Nielsen, Wayne Ward and James H. Martin. (2009). Recognizing entailment in intelligent tutoring systems. In Ido Dagan, Bill Dolan, Bernardo Magnini and Dan Roth (Eds.):The Journal of Natural Language Engineering, (JNLE), 15, pp 479-501. Copyright Cambridge University Press, Cambridge, United Kingdom.
- Rus, V., Lintean, M., Moldovan, C., Baggett, W., Niraula, N., & Morgan, B. (2012). The SIMILAR Corpus: A Resource to Foster the Qualitative Understanding of Semantic Similarity of Texts. In Semantic Relations II: Enhancing Resources and Applications, The 8th Language Resources and Evaluation Conference (LREC 2012), May (pp. 23-25).
- Erik F Tjong Kim Sang, Sabine Buchholz, (2000). Introduction to the CONLL-2000 shared task: Chunking. Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning-Volume 7