Semantic Textual Similarity for Spanish
Semantic textual similarity (STS) has received an increasing amount of attention in recent years, culminating with the Semeval/*SEM tasks organized in 2012 and 2013, each bringing together more than 30 participating teams. While the focus to date has been on measuring the similarity of texts in English, we believe it is important to also develop and evaluate methods for text similarity in other languages. This is motivated by the growing number of documents available online and elsewhere which are written in languages other than English, as well as by the increased interest in computational linguistic resources and tools for other languages.
The goal of this subtask is to enable the evaluation of semantic textual similarity systems for Spanish.
Participants in the task will submit the output of systems developed to measure semantic textual similarity in Spanish. The annotations and systems will use a scale from 0 (no relation) to 4 (semantic equivalence), indicating the similarity between two sentences. A development dataset of 65 annotated sentence pairs is provided. The test data will consists of two datasets, one of 324 sentencec pairs, and another one of of 480 sentence pairs. No training data will be provided, although systems that need training can use the development dataset for this purpose. The development dataset has already been annotated in previous work [1]; the test dataset will be annotated by five native Spanish speakers. Participating systems will be evaluated using the same metrics traditionally employed in the evaluation of STS systems, and also used in previous Semeval/*SEM STS evaluations, i.e., mean Pearson correlation between the system output and the gold standard annotations.
Introduction
Given two sentences of text, s1 and s2, the systems participating in this task should compute how similar s1 and s2 are, returning a similarity score, and an optional confidence score. The scores should range from 0 to 4, where 4 marks paraphrases, and 0, sentences that have absolutely no relation.
The test dataset contains sentence pairs coming from the following:
1) Spanish news articles (news)
2) Wikipedia articles (wikipedia)
The datasets have been derived as follows:
- STS.input.news.txt: The sentences are extracted from recent newspaper articles (2014) published in Spanish publications from around the world. The articles were mined from the Google News Spanish service http://news.google.es/.
- STS.input.wikipedia.txt: The sentences were selected from a December 2013 dump of the Spanish version of Wikipedia.
NOTE: Participant systems should NOT use the following datasets to develop or train their systems:
- Spanish version of Wikipedia for the Wikipedia test set.
Input format
The input files consist of two fields separated by tabs:
- first sentence (does not contain tabs)
- second sentence (does not contain tabs)
Please check any of STS.input.*.txt files. The file encoding is UTF8 (to correctly render diacritics).
Gold Standard
The gold standard contains a score between 0 and 4 for each pair of sentences, with the following interpretation:
(4) The two sentences are completely equivalent, as they mean the same thing.
The bird is bathing in the sink.
Birdie is washing itself in the water basin.
El pájaro se esta bañando en el lavabo.
El pájaro se está lavando en el aguamanil.
(3) The two sentences are mostly equivalent, but some details differ.
John said he is considered a witness but not a suspect.
"He is not a suspect anymore." John said.
John dijo que él es considerado como testigo, y no como sospechoso.
"Él ya no es un sospechoso," John dijo.
(2) The two sentences are roughly equivalent, but some important information differs/missing.
They flew out of the nest in groups.
They flew into the nest together.
Ellos volaron del nido en grupos.
Volaron hacia el nido juntos.
(1) The two sentences are not equivalent, but are on the same topic.
The woman is playing the violin.
The young lady enjoys listening to the guitar.
La mujer está tocando el violín.
La joven disfruta escuchar la guitarra.
(0) The two sentences are on different topics.
John went horse back riding at dawn with a whole group of friends.
Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.
Al amanecer, Juan se fue a montar a caballo con un grupo de amigos.
La salida del sol al amanecer es una magnífica vista que puede presenciar si usted se despierta lo suficientemente temprano para verla.
Answer format
The answer format consists of a similarity score followed by an optional confidence score. Each line has two fields separated by a tab:
- a number between 0 and 4 (the similarity score)
- a number between 0 and 100 (the confidence score)
See file STS.output.wikipedia.es.txt for a sample output file your system should generate. The files you submit should follow the same naming structure: STS.output.*.es.txt, where * is either wikipedia or news.
The use of confidence scores is experimental, and it is not required for the official score.
Scoring
The official score is based on the average of Pearson correlation. The use of confidence scores will be experimental, and it is not required for the official scores.
Participation in the task
Participant teams will be allowed to submit three runs at most.