SSMT:A Machine Translation Evaluation View To Paragraph-to-Sentence Semantic Similarity

This paper presents the system SSMT measuring the semantic similarity between a paragraph and a sentence submitted to the SemEval 2014 task3: Cross-level Semantic Similarity. The special difﬁculty of this task is the length disparity between the two semantic comparison texts. We adapt several machine translation evaluation metrics for features to cope with this difﬁculty, then train a regression model for the semantic similarity prediction. This system is straightforward in intuition and easy in implementation. Our best run gets 0.808 in Pearson correlation. METEOR-derived features are the most effective ones in our experiment.


Introduction
Cross level semantic similarity measures the similarity between different levels of text unit, for example, between a document and a paragraph, or between a phrase and a word.
Paragraph and sentence are the natural language units to convey opinions or state events in daily life. We can see posts on forums, questions and answers in Q&A communities and customer reviews on E-commerce websites, are mainly organised in these two units. Better similarity measurement across them will be helpful in clustering similar answers or reviews.
The paragraph-to-sentence semantic similarity subtask in SemEval2014 task3 (Jurgens et al., 2014) is the first semantic similarity competition across these two language levels. The special difficulty of this task is the length disparity between the compared pair: a paragraph contains This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http: //creativecommons.org/licenses/by/4.0/ 3.67 times the words of a sentence on average in the training set.
Semantic similarity on different levels, for example, on word level (Mikolov et al., 2013), sentences level (Bär et al., 2012), document level (Turney andPantel, 2010), have been well studied, yet methods on one level can hardly be applied to a different level, let alone be applied for the crosslevel tasks. The work of Pilehvar et al.(2013) was an exception. They proposed a unified method for semantic comparison at multi-levels all the way from comparing word senses to comparing text documents Our work is inspired by automatic machine translation(MT) evaluation, in which different metrics are designed to compare the adequacy and fluency of a MT system's output, called hypothesis, against a gold standard translation, called reference. As MT evaluation metrics measure sentence pair similarity, it is a natural idea to generalize them for paragraph-sentence pair.
In this paper, we follow the motivations of several MT evaluation metrics yet made adaption to cope with the length disparity difficulty of this task, and combine these features in a regression model. Our system SSMT (Semantic Similarity in view of Machine Translation evaluation) involves no extensive resource or strenuous computation, yet gives promising result with just a few simple features.

Regression Framework
In our experiment, we use features adapted from some MT evaluation metrics and combine them in a regression model for the semantic similarity measurement. We exploit the following two simple models: A linear regression model is presented as: A log-linear model is presented as: Where y is the similarity score, {x 1 , x 2 .., x n } are the feature values.
We can see that in a log-linear model, if any feature x i get a value of 0, the output y will suck in 0 forever no matter what the values other features get. In our experiment we resort to smoothing to avoid this "0-trap" for some features (Section 4.3).

Features
MT evaluation metrics vary from lexical level to syntactic level to semantic level. We consider only lexical ones to avoid complicated steps like parsing or semantic role labelling, which are computational expensive and may bring extra noise.
But instead of directly using the MT evaluation metrics, we use the factors in them as features, the idea is that the overall score of the original metric is highly related to the length of both of the compared pair, but its factors are often related to the length of just one side yet still carry useful similarity information.

BLEU-Derived Features
As the most wildly used MT evaluation metric, BLEU (Papineni et al., 2002) uses the geometric mean of n-gram precisions to measure the hypotheses against references. It is a corpus-based and precision-based metric, and uses "brevity penalty" as a replacement for recall. Yet this penalty is meaningless on sentence level. Therefore we considers only the precision factors in BLEU: We use the modified n-gram precision here and regard "paragraph" as "reference", and "sentence" as the "hypothesis". N = 1,2,3,4. We call these four features BLEU-derived features.

ROUGE-L-Derived Features
ROUGE-L (Lin and Och, 2004) measures the largest common subsequence(LCS) between a compared pair. BLEU implies the n-gram to be consecutive, yet ROUGE-L allows for gaps between them. By considering only in-sequence words, ROUGE-L captures sentence level structure in a natural way, then: hoy) is the length of LCS of the compared pair. We set β = 1, which means we don't want to make much distinction between the "reference" and "hypothesis" here. We call these three features ROUGE-L-derived features.

ROUGE-S-Derived Features
ROUGE-S (Lin and Och, 2004) uses skip-bigram co-occurrence statistics for similarity measurement. One advantage of skip-bigram over BLEU is that it does not require consecutive matches but is still sensitive to word order. Given the reference of length n, and hypothesis of length m, then: Where C is combination, and skip2(ref, hyo) is the number of common skip-bigrams. We also set β = 1 here, and call these three indicators ROUGE-S-derived features.

METEOR-Derived Features
METEOR (Banerjee and Lavie, 2005) evaluates a hypothesis by aligning it to a reference translation and gives sentence-level similarity scores. It uses a generalized concept of unigram mapping that matches words in the following types: exact match on words surface forms , stem match on words stems, synonym match according to the synonym sets in WordNet, and paraphrase match (Denkowski and Lavie, 2010). METEOR also makes distinction between content words and function words. Each type of match m i is weighted by w i , let (m i (h c ), m i (h f )) be the number of content and function words covered by this type in the hypothesis, and (m i (r c ), m i (r f )) be the counts in the reference, then: To account for word order difference, the fragmentation penalty is calculated using the total number of matched words(m) and the number of chunks 1 (ch) in the hypothesis: And the final METEOR score is: Parameters α, β, γ, δand w i ...w n are tuned to maximize correlation with human judgements (Denkowski and Lavie, 2014). We use Meteor1.5 system 2 for scoring. Parameters are tuned on WMT12, and the paraphrase table is extracted on the WMT data.
We use the p, r, f rag(f rag = ch/m) and score as features and call them METEOR-derived features.

Data Set
The SemEval2014 task3 subtask gives a training set of 500 paragraph-sentence pairs, with human annotated continuous score of 0 − 4. These pairs are labelled with genres of "Newswire/ cqa 3 / metaphoric/ scientific/ travel/ review". Systems are asked to predict the similarity scores for 500 pairs in the test set. Performance is evaluated in Pearson correlation and Spearman correlation.

Data Processing
To avoid meaningless n-gram match "the a", or words surface form difference, we employ very simple data processings here: for features derived from BLEU, ROUGE-L and ROUGE-S, we remove stop words and stem the sentences with 1 Chunk is defined as a series of matched unigrams that is contiguous and identically ordered in both sentences 2 https://www.cs.cmu.edu/ alavie/METEOR/ 3 cqa:Community Question Answering site text coreNLP 4 . For METEOR-derived features, we use the tool's option for text normalization before matching.

Result
Though texts with different genres may have different regression parameters, we just train one model for all for simplicity. Table 1 compares the result. Run1 is submitted as SSMT in the official evaluation. It's a log-linear model. We choose more dense features for log-linear model and use smoothing to avoid the "0-trap" mentioned in (Section 2). The features include P 1,2 BLEU ,P ROU GE−L ,P ROU GE−S 4 features, and 4 METEOR-derived features, altogether 8 features. When calculation the first 4 features, we plus 1 to both numerator and denominator as smoothing. Run2 is a linear-regression model with the same features as Run1. Run3 is a simple linear regression model, which is free from the "0-trap", thus we use all the 14 features without smoothing. We use Matlab for regression. The baseline is officially given using LCS.

System Analysis
We compares the effectiveness of different features in a linear regression model. Table 2 shows the result. "All" refers to all the features, "-METEOR" means the feature set excludes METEOR-derived features. We can see the METEOR-derived features are the most effective ones here. Figure 1 shows the performance of our system submitted as SSMT in the SemEval2014 task3 competition. It shows quite good correlation with the gold standard.
A well predicted example is the #trial-p2s-5 pair in the trial set: Paragraph: Olympic champion Usain Bolt regained his 100m world title and won a fourth individual World Championships gold with a season's best of 9.77 seconds in Moscow. In heavy   rain, the 26-year-old Jamaican made amends for his false start in Daegu two years ago and further cemented his status as the greatest sprinter in history. The six-time Olympic champion overtook Justin Gatlin in the final stages, forcing the American to settle for silver in 9.85. Bolt's compatriot Nesta Carter (9.95) claimed bronze, while Britain's James Dasaolu was eighth (10.21). Sentence: Germany's Robert Harting beats Iran's Ehsan Hadadi and adds the Olympic discus title to his world crown.
The system gives a prediction of 1.253 against the gold standard 1.25. We can see that topic words like "Olympic" , "world crown", "beats" in the short text correspond to expressions of "world title " , "champion" across several sentences in the long text, but this pair of texts are not talking about the same event. The model captures and models this commonness and difference very well .
But Figure 1 also reveals an interesting phenomenon: the system seldom gives the boundary scores of 0 or 4. In other words, it tends to overscore or underscore the boundary conditions. An example in point is the #trial-p2s-17 pair in the trial data, it is actually the worst predicted pair by our system in the trail set: Paragraph: A married couple who met at work is not a particularly rare thing. Three in ten workers who have dated a colleague said in a recent survey by CareerBuilder.com that their office romance eventually led to marriage.
Sentence: Marrying a coworker isn't uncommon given that 30% of workers who dated a coworker ended up marrying them.
The system gives a 1.773 score against the gold standard of 4. It should fail to detect the equality of expressions between "three in ten" and "30%". Thus better detection of phrase similarity is desired. We think this is the main reason to underscore the similarity. For test pairs with the genre of "Metaphoric", the system almost underscores all of them. This failure has been expected, though. Because "Metaphoric" pairs demand full understanding of the semantic meaning and paragraph structure, which is far beyond the reach of lexical match metrics.

Conclusion
MT evaluation metrics have been directly used as features in paraphrase (Finch et al., 2005) detection and sentence pair semantic comparison (Souza et al., 2012). But paragraph-to-sentence pair faces significant length disparity, we try a way out to alleviate this impact yet still follow the motivations underlying these metrics. By factorizing down the original metrics, the linear model can flexibly pick out factors that are not sensitive to the length disparity problem.
We derive features from BLEU, ROUGE-L, ROUGE-S and METEOR, and show that METEOR-derived features make the most significant contributions here. Being easy and light, our submitted SSMT achieves 0.789 in Pearson and 0.777 in Spearman correlation, and ranks 11 out of the 34 systems in this subtask. Our best try achieves 0.808 in Pearson and 0.786 in Spearman correlation.