TeamZ: Measuring Semantic Textual Similarity for Spanish Using an Overlap-Based Approach

This paper presents an overlap-based approach using bag of words and the Spanish WordNet to solve the STS-Spanish sub-task (STS-Es) of SemEval-2014 Task 10. Since bag of words is the most commonly used method to ascertain similarity, the performance is modest.


Introduction
The objective of STS-Es is to score a pair of sentences in Spanish on the scale of 0 (the two sentences are on different topics) to 4 (the two sentences are completely equivalent, as they mean the same thing) (Agirre et al., 2014). The textual similarity finds its utility in various NLP applications such as information retrieval, text categorisation, word sense disambiguation, text summarisation, topic detection, etc. (Besançon et al., 1999;Mihalcea et al., 2006;Islam and Inkpen, 2008).
The method presented in this paper calculates the similarity based on the number of words that are common in two given sentences. This approach, being simplistic, suffers from various drawbacks. Firstly, the semantically similar sentences need not have many words in common (Li et al., 2006). Secondly, even if the sentences have many words in common, the context in which they are used can be different (Sahami and Heilman, 2006). For example, based on the bag of words approach, the sentences in Table 1 would be scored the same: However, only sentences [2] and [3] mean the same.
Despite the flaws, this approach was used because of the Basic Principle of Compositionality (Zimmermann, 2011), which states that the He is clever. 2Él está listo.
He is ready. 3Él está preparado. He is prepared. meaning of a complex expression depends upon the meaning of its components and the manner in which they are composed. Furthermore, mainly nouns were considered in the bag of words because Spanish is an exocentric language, and nouns contain more specific, concrete semantic information than verbs (Michael Herslund, 2010;Michael Herslund, 2012).

Methodology
The training dataset provided for the task consisted of 65 pairs of sentences along with their corresponding similarity scores. There were two test sets: one consisted of 480 sentence pairs from a news corpus, and the other had 324 sentence pairs taken from Wikipedia.
The approach consisted of learning the scoring with the help of linear regression. Two runs were submitted as solutions. The first run used threefeature vectors, whereas the second one used fourfeature vectors. The features are the Jaccard indices for the lemmas, noun lemmas, synsets, and noun subjects in each sentence pair. For both runs, the sentence pairs were parsed using the TreeTagger (Schmid, 1994). The TreeTagger was used because it provides the part-of-speech tag and lemma for each word of a sentence.
Run 1 used these features: • The fraction of lemmas that were common between the two sentences. In other words, the number of unique lemmas common between the sentences divided by the total number of unique lemmas of the two sentences.
• The fraction of noun lemmas common between the two sentences.
• The fraction of synsets common between the two sentences. For each noun, its corresponding synset 1 was extracted from the Spanish WordNet (spaWN) of the Multilingual Central Repository 2 (MCR 3.0) (Gonzalez- Agirre et al., 2012).
Run 2 employed one more feature in addition to the aforementioned, which was the fraction of synsets of noun subjects that were common for each sentence pair. The subject nouns were extracted from the sentences after parsing them with the MaltParser (Nivre et al., 2007). Since the Tree-Tagger PoS tagset 3 differed from the EAGLES (Expert Advisory Group on Language Engineering Standards) tagset 4 required by the MaltParser, rules were written to best translate the TreeTagger tags into EAGLES tags. However, one-toone mapping was not possible: EAGLES tags are seven characters long and encode number and gender, whereas TreeTagger tags do not. For example, using the EAGLES tagset, the masculine singular common nounárbol 'tree' is tagged as NCMS000, whereas the feminine singular common noun hoja 'leaf' is tagged as NCFS000; TreeTagger, on the other hand, tags both as NC.  the words. Finally, converting TreeTagger tags to those required by the MaltParser instead of using a parser which annotates with EAGLES tags may also have contributed to the relatively low Run 2 score. However, the confidence intervals of the two runs obtained after bootstrapping overlapped. Thus, the difference between the two runs for both the datasets is not statistically significant.