AI-KU: Using Co-Occurrence Modeling for Semantic Similarity

In this paper, we describe our unsupervised method submitted to the Cross-Level Semantic Similarity task in Semeval 2014 that computes semantic similarity between two different sized text fragments. Our method models each text fragment by using the co-occurrence statistics of either occurred words or their substitutes. The co-occurrence modeling step provides dense, low-dimensional embedding for each fragment which allows us to calculate semantic similarity using various similarity metrics. Although our current model avoids the syntactic information, we achieved promising results and outperformed all baselines.

There are three main approaches to computing the semantic similarity between two text fragments. The first approach uses Vector Space Models (see Turney & Pantel (2010) for an overview) where each text is represented as a bag-of-word model. The similarity between two text fragments can then be computed with various metrics such as cosine similarity. Sparseness in the input nature is the key problem for these models. Therefore, later works such as Latent Semantic Indexing (?) and This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ Topic Models (Blei et al., 2003) overcome sparsity problems via reducing the dimensionality of the model by introducing latent variables. The second approach blends various lexical and syntactic features and attacks the problem through machine learning models. The third approach is based on word-to-word similarity alignment (Pilehvar et al., 2013;Islam and Inkpen, 2008).
The Cross-Level Semantic Similarity (CLSS) task in SemEval 2014 1 (Jurgens et al., 2014) provides an evaluation framework to assess similarity methods for texts in different volumes (i.e., lexical levels). Unlike previous SemEval and *SEM tasks that were interested in comparing texts with similar volume, this task consists of four subtasks (para-graph2sentence, sentence2phrase, phrase2word and word2sense) that investigate the performance of systems based on pairs of texts of different sizes. A system should report the similarity score of a given pair, ranging from 4 (two items have very similar meanings and the most important ideas, concepts, or actions in the larger text are represented in the smaller text) to 0 (two items do not mean the same thing and are not on the same topic).
In this paper, we describe our two unsupervised systems that are based on co-occurrence statistics of words. The only difference between the systems is the input they use. The first system uses the words directly (after lemmatization, stop-word removal and excluding the non-alphanumeric characters) in text while the second system utilizes the most likely substitutes consulted by a 4-gram language model for each observed word position (i.e., context). Note that we participated two subtasks which are paragraph2sentence and sentence2phrase.
The remainder of the paper proceeds as follows. Section 2 explains the preprocessing part, the difference between the systems, co-occurrence modeling, and how we calculate the similarity between two texts after co-occurrence modeling has been done. Section 3 discusses the results of our systems and compares them to other participants'. Section 4 discusses the findings and concludes with plans for future work.

Algorithm
This section explains preprocessing steps of the data and the details of our two systems 2 . Both systems rely on the co-occurrence statistics. The slight difference between the two is that the first one uses the words that occur in the given text fragment (e.g., paragraph, sentence), whereas the latter employs co-occurrence statistics on 100 substitute samples for each word within the given text fragment.

Data Preprocessing
Two AI-KU systems can be distinguished by their inputs. One uses the raw input words, whereas the other uses words' likely substitutes according to a language model. AI-KU 1 : This system uses the words that were in the text. All words are transformed into lowercase equivalents. Lemmatization 3 and stop-word removal were performed, and non-alphanumeric characters were excluded. Table 1 displays the pairs for the following sentence which is an instance from paragraph2sentence test set: "Choosing what to buy with a $35 gift card is a hard decision." Note that the input that we used to model cooccurrence statistics consists of all such pairs for each fragment in a given subtask. 2 The code to replicate our work can be found at https://github.com/osmanbaskaya/ semeval14-task3.
3 Lemmatization is carried out with Stanford CoreNLP and transforms a word into its canonical or base form.
AI-KU 2 : Previously, the utilization of high probability substitutes and their co-occurrence statistics achieved notable performance on Word Sense Induction (WSI) (Baskaya et al., 2013) and Partof-Speech Induction (Yatbaz et al., 2012) problems. AI-KU 2 represents each context of a word by finding the most likely 100 substitutes suggested by the 4-gram language model we built from ukWaC 4 (Ferraresi et al., 2008), a 2-billion word web-gathered corpus. Since S-CODE algorithm works with discrete input, for each context we sample 100 substitute words with replacement using their probabilities. Table 2 illustrates the context and substitutes of each context using a bigram language model. No lemmatization, stop-word removal and lowercase transformation were performed.

Co-Occurrence Modeling
This subsection will explain the unsupervised method we employed to model co-occurrence statistics: the Co-occurrence data Embedding (CODE) method (Globerson et al., 2007) and its spherical extension (S-CODE) proposed by Maron et al. (2010). Unlike in our WSI work, where we ended up with an embedding for each word in the co-occurrence modeling step in this task, we model each text unit such as a paragraph, a sentence or a phrase, to obtain embeddings for each instance.
Input data for S-CODE algorithm consist of instanceid and each word in the text unit for the first system (Table 1 illustrates the pairs for only one text fragment) instance-ids and 100 substitute samples of each word in text for the second system. In the initial step, S-CODE puts all instance-ids and words (or substitutes, depending on the system) randomly on an n-dimensional sphere. If two different instances have the same word or substitute, then these two instances attract one another -otherwise they repel each other. When S-CODE converges, instances that have similar words or substitutes will be closely located or else, they will be distant from each other.
AI-KU 1 : According to the training set performances for various n (i.e., number of dimensions for S-CODE algorithm), we picked 100 for both tasks.
AI-KU 2 : We picked n to be 200 and 100 for paragraph2sentence and sentence2phrase subtasks, respectively.

System Pearson Spearman
Paragraph-2-Sentence  Since this step is unsupervised, we tried to enrich the data with ukWaC, however, enrichment with ukWaC did not work well on the training data. To this end, proposed scores were obtained using only the training and the test data provided by organizers.

Similarity Calculation
When the S-CODE converges, there is an n-dimensional embedding for each textual level (e.g., paragraph, sentence, phrase) instance. We can use a similarity metric to calculate the similarity between these embeddings. For this task, systems should report only the similarity between two specific cross level instances. Note that we used cosine similarity to calculate similarity between two textual units. This similarity is the eventual similarity for two instances; no further processing (e.g., scaling) has been done.
In this task, two correlation metrics were used to evaluate the systems: Pearson correlation and Spearman's rank correlation. Pearson correlation tests the degree of similarity between the system's similarity ratings and the gold standard ratings. Spearman's rank correlation measures the degree of similarity between two rankings; similarity ratings provided by a system and the gold standard ratings.

Evaluation Results
Tables 3 and 4 show the scores for Paragraph-2-Sentence and Sentence-2-Phrase subtasks on the training data, respectively. These tables contain the best individual scores for the performance metrics, Normalized Longest Common Substring (LCS) baseline, which was given by task organizers, and three additional baselines: lin (Lin, 1998), lch (Leacock and Chodorow, 1998), and the Jaccard Index (JI) baseline. lin uses the information content (Resnik, 1995) of the least common subsumer of concepts A and B. Information content (IC) indicates the specificity of a concept; the least common subsumer of a concept A and B is the most specific concept from which A and B are inherited. lin similarity 5 returns the difference between two times of the IC of the least common subsumer of A and B, and the sum of IC of both concepts. On the other hand, lch is a score denoting how similar two concepts are, calculated by using the shortest path that connects the concept and the maximum depth of the taxonomy in which the concepts occur 6 (please see Pedersen et al. (2004) for further details of these measures). These two baselines were calculated as follows. First, using the Stan-  ford Part-of-Speech Tagger (Toutanova and Manning, 2000) we tagged words across all textual levels. After tagging, we found the synsets of each word matched with its part-of-speech using Word-Net 3.0 (Miller and Fellbaum, 1998). For each synset of a word in the shorter textual unit (e.g., sentence is shorter than paragraph), we calculated the lin/lch measure of each synset of all words in the longer textual unit and picked the highest score. When we found the scores for all words, we calculated the mean to find out the similarity between one pair in the test set. Finally, Jaccard Index baseline was used to simply calculate the number of words in common (intersection) with two cross textual levels, normalized by the total number of words (union). Table 5 and 6 demonstrate the AI-KU runs on the test data. Next, we present our results pertaining to the test data.

System Pearson Spearman
Paragraph2Sentence: Both systems outperformed all the baselines for both metrics. The best score for this subtask was .837 and our systems achieved .732 and .698 on Pearson and did similar on Spearman metric. These scores are promising since our current unsupervised systems are based on bag-ofwords approach -they do not utilize any syntactic information.
Sentence2Phrase: In this subtask, AI-KU systems outperformed all baselines with the exception of the AI-KU 2 system which performed slightly worse than LCS on Spearman metric. Performances of systems and baselines were lower than Para-

Conclusion
In this work, we introduced two unsupervised systems that utilize co-occurrence statistics and represent textual units as dense, low dimensional embeddings. Although current systems are based on bag-of-word approach and discard the syntactic information, they achieved promising results in both paragraph2sentence and sentence2phrase subtasks. For future work, we will extend our algorithm by adding syntactic information (e.g, dependency parsing output) into the co-occurrence modeling step.