haLF: Comparing a Pure CDSM Approach with a Standard Machine Learning System for RTE

,


Introduction
Recognizing Textual Entailment is a largely explored problem (Dagan et al., 2013). Past challenges Bar-Haim et al., 2006;Giampiccolo et al., 2007) explored methods and models applied in complex and natural texts. In this context, machine learning solutions show interesting results. The Shared Task #1 of SemEval instead wants to explore systems in a more controlled textual environment where the phenomena to model are clearer. The aim of the Shared Task is to study how RTE systems built upon compositional distributional semantic models behave This work is licenced under a Creative Commons Attribution 4.0 International License. Page numbers and proceedings footer are added by the organizers. License details: http://creativecommons.org/licenses/by/ 4.0/ with respect to the above tradition. We tried to capture this underlying idea of the task.
In this paper, we describe our submission to the Shared Task #1. We tried to follow the underlying idea of the task, that is, evaluating the gap of full-fledged recognizing textual entailment systems with respect to compositional distributional semantic models (CDSMs) applied to this task. We thus submitted two runs: 1) a system obtained with a machine learning approach based on the feature spaces of rules with variables  and 2) a system completely based on a CDSM that mixes structural and syntactic information by using distributed tree kernels (Zanzotto and Dell'Arciprete, 2012). Our analysis shows that, under the same conditions, the fully CDSM system is still far from being competitive with more complete methods.
The rest of the paper is organized as follows. Section 2 describes the full-fledged recognizing textual entailment system that is used for comparison. Section 3 introduces a novel compositional distributional semantic model, namely, the distributed smoothed tree kernels, and the way this model is applied to the task of RTE. Section 4 describes the results in the challenge and it draws some preliminary conclusions.

A Standard full-fledged Machine
Learning Approach for RTE For now on, the task of recognizing textual entailment (RTE) is defined as the task to decide if a pair p = (a, b) like: ("Two children are lying in the snow and are making snow angels", "Two angels are making snow on the lying children") is in entailment, in contradiction, or neutral. As in the tradition of applied machine learn-ing models, the task is framed as a multiclassification problem. The difficulty is to determine the best feature space on which to train the classifier. A full-fledged RTE systems based on machine learning that has to deal with natural occurring text is generally based on: • some within-pair features that model the similarity between the sentence a and the sentence b • some features representing more complex information of the pair (a, b) such as rules with variables that fire (Zanzotto and Moschitti, 2006) In the following, we describe the within-pair feature and the syntactic rules with variable features used in the full-fledged RTE system. As the second space of features is generally huge, the full feature space is generally used in kernel machines where the final kernel between two instances p 1 = (a 1 , b 1 ) and p 2 = (a 2 , b 2 ) is: where F R counts how many rules are in common between p 1 and p 2 and W T S computes a lexical similarity between a and b. In the following sections we describe the nature of W T S and of F R

Weighted Token Similarity (WTS)
This similarity model was first defined bt Corley and Mihalcea (2005) and since then has been used by many RTE systems. The model extends a classical bag-of-word model to a Weighted-Bag-of-Word (wbow) by measuring similarity between the two sentences of the pair at the semantic level, instead of the lexical level. For example, consider the pair: "Oscars forgot Farrah Fawcett", "Farrah Fawcett snubbed at Academy Awards". This pair is redundant, and, hence, should be assigned a very high similarity. Yet, a bag-of-word model would assign a low score, since many words are not shared across the two sentences. wbow fixes this problem by matching 'Oscar'-'Academy Awards' and 'forgot'-'snubbed' at the semantic level. To provide these matches, wbow relies on specific word similarity measures over WordNet (Miller, 1995), that allow synonymy and hyperonymy matches: in our experiments we specifically use Jiang&Conrath similarity (Jiang and Conrath, 1997).

Rules with Variables as Features
The above model alone is not sufficient to capture all interesting entailment features as the relation of entailment is not only related to the notion of similarity between a and b.
In the tradition of RTE, an interesting feature space is the one where each feature represents a rule with variables, i.e. a first order rule that is activated by the pairs if the variables are unified. This feature space has been introduced in (Zanzotto and Moschitti, 2006) and shown to improve over the one above. Each feature f r 1 , f r 2 is a pair of syntactic tree fragments augmented with variables. The feature is active for a pair (t 1 , t 2 ) if the syntactic interpretations of t 1 and t 2 can be unified with < f r 1 , f r 2 >. For example, consider the following feature: This feature is active for the pair ("GM bought Opel ","GM owns Opel "), with the variable unification X = "GM " and Y = "Opel ". On the contrary, this feature is not active for the pair ("GM bought Opel ","Opel owns GM ") as there is no possibility of unifying the two variables.
F R(p 1 , p 2 ) is a kernel function that counts the number of common rules with variables between p 1 and p 2 . Efficient algorithms for the computation of the related kernel functions can be found in (Moschitti and Zanzotto, 2007;Zanzotto and Dell'Arciprete, 2009;Zanzotto et al., 2011). The above full-fledged RTE system, although it may use distributional semantics, is not a model that applies a compositional distributional semantic model as it does not explicitly transform sentences in vectors, matrices, or tensors that represent their meaning.
We here propose a model that can be considered a compositional distributional semantic model as it transforms sentences into matrices that are then used by the learner as feature vectors. Our model is called

Notation
Before describing the distributed smoothed trees (DST) we introduce a formal way to denote constituency-based lexicalized parse trees, as DSTs exploit this kind of data structures. Lexicalized trees are denoted with the letter t and N (t) denotes the set of non terminal nodes of tree t. Each non-terminal node n ∈ N (t) has a label l n composed of two parts l n = (s n , w n ): s n is the syntactic label, while w n is the semantic headword of the tree headed by n, along with its part-of-speech tag. Terminal nodes of trees are treated differently, these nodes represent only words w n without any additional information, and their labels thus only consist of the word itself (see Fig. 2). The structure of a DST is represented as follows: Given a tree t, h(t) is its root node and s(t) is the tree formed from t but considering only the syntactic structure (that is, only the s n part of the labels), c i (n) denotes i-th child of a node n. As usual for constituency-based parse trees, pre-terminal nodes are nodes that have a single terminal node as child.
Finally, we use → w n ∈ R k to denote the distributional vector for word w n , whereas T represents the matrix of a tree t encoding structure and distributional meaning.

The Method in a Glance
We describe here the approach in a few sentences. In line with tree kernels over structures (Collins and Duffy, 2002), we introduce the set S(t) of the subtrees t i of a given lexicalized tree t. A subtree t i is in the set S(t) if s(t i ) is a subtree of s(t) and, if n is a node in t i , all the siblings of n in t are in t i . For each node of t i we only consider its syntactic label s n , except for the head h(t i ) for which we also consider its semantic component w n (see Fig.  1). The functions DSTs we define compute the following: where T i is the matrix associated to each subtree t i . The similarity between two text fragments a and b represented as lexicalized trees t a and t b can be computed using the Frobenius product between the two matrices T a and T b , that is: We want to obtain that the product T a i , T b j F approximates the dot product between the distributional vectors of the head words ) whenever the syntactic structure of the subtrees is the same (that is s(t a i ) = s(t b j )), and T a i , T b j F ≈ 0 otherwise. This property is expressed as: To obtain the above property, we define where → s(t i ) are distributed tree fragment (Zanzotto and Dell'Arciprete, 2012) for the subtree t and → w h(t i ) is the distributional vector of the head of the subtree t. Distributed tree fragments have the property . Thus, given the important property of the outer product that applies in the Frobenius product: we have that Equation 2 is satisfied as: It is possible to show that the overall compositional distributional model DST (t) can be obtained with a recursive algorithm that exploit vectors of the nodes of the tree. The compositional distributional model is then used in the same learning machine used for the traditional RTE system with the following kernel function:

Results and Conclusions
For the submission we used the java version of LIBSVM (Chang and Lin, 2011 We parsed the sentence with the Stanford Parser (Klein and Manning, 2003) and extracted the heads for use in the lexicalized trees with Collins' rules (Collins, 2003). Table 1 reports our results on the textual entailment classification task, together with the maximum, minimum and average score for the challenge. The first observation is that the full-fledged RTE system is still definitely better than our CDSM system. We believe that the main reason is that the DST cannot encode variables which is an important aspect to capture when dealing with textual entailment recognition. This is particularly true for this dataset as it focuses on word ordering and on specific and recurrent entailment rules. Our full-fledged system scored among the first 10 systems, slightly above the overall average score, but our pure CDSM system is instead ranked within the last 3. We think that a more in-depth comparison with other fully CDSM systems will give us a better insight on our model and will also assess more realistically the quality of our system.