Bielefeld SC: Orthonormal Topic Modelling for Grammar Induction

In this paper, we consider the application of topic modelling to the task of induct-ing grammar rules. In particular, we look at the use of a recently developed method called orthonormal explicit topic analysis, which combines explicit and latent models of semantics. Although, it remains unclear how topic model may be applied to the case of grammar induction, we show that it is not impossible and that this may allow the capture of subtle semantic distinctions that are not captured by other methods.


Introduction
Grammar induction is the task of inducing highlevel rules for application of grammars in spoken dialogue systems. In practice, we can extract relevant rules and the task of grammar induction reduces to finding similar rules between two strings. As these strings are not necessarily similar in surface form, what we really wish to calculate is the semantic similarity between these strings. As such, we could think of applying a semantic analysis method. As such we attempt to apply topic modelling, that is methods such as Latent Dirichlet Allocation (Blei et al., 2003), Latent Semantic Analysis (Deerwester et al., 1990) or Explicit Semantic Analysis (Gabrilovich and Markovitch, 2007). In particular we build on the recent work to unify latent and explicit methods by means of orthonormal explicit topics.
In topic modelling the key choice is the document space that will act as the corpus and hence topic space. The standard choice is to regard all articles from a background document collection -Wikipedia articles are a typical choice -as the This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http://creativecommons.org/licenses/by/4.0/ topic space. However, it is crucial to ensure that these topics cover the semantic space evenly and completely. Following McCrae et al. (McCrae et al., 2013) we remap the semantic space defined by the topics in such a manner that it is orthonormal. In this way, each document is mapped to a topic that is distinct from all other topics.
The structure of the paper is as follows: we describe our method in three parts, first the method in section 2, followed by approximation method in section 3, the normalization methods in section 4 and finally the application to grammar induction in section 5, we finish with some conclusions in section 6.
2 Orthonormal explicit topic analysis ONETA (McCrae et al., 2013, Orthonormal explicit topic analysis) follows Explicit Semantic Analysis in the sense that it assumes the availability of a background document collection B = {b 1 , b 2 , ..., b N } consisting of textual representations. The mapping into the explicit topic space is defined by a language-specific function Φ that maps documents into R N such that the j th value in the vector is given by some association measure φ j (d) for each background document b j . Typical choices for this association measure φ are the sum of the TF-IDF scores or an information retrieval relevance scoring function such as BM-25 (Sorg and Cimiano, 2010).
For the case of TF-IDF, the value of the j-th element of the topic vector is given by: Thus, the mapping function can be represented as the product of a TF-IDF vector of document d multiplied by a word-by-document (W × N ) TF-IDF matrix, which we denote as a X: 1 1T denotes the matrix transpose as usual For simplicity, we shall assume from this point on that all vectors are already converted to a TF-IDF or similar numeric vector form.
In order to compute the similarity between two documents d i and d j , typically the cosine-function (or the normalized dot product) between the vectors Φ(d i ) and Φ(d j ) is computed as follows: The key challenge with topic modelling is choosing a good background document collection B = {b 1 , ..., b N }. A simple minimal criterion for a good background document collection is that each document in this collection should be maximally similar to itself and less similar to any other document: As shown in McCrae et al. (2013), this property is satisfied by the following projection: And hence the similarity between two documents can be calculated as:

Approximations
ONETA relies on the computation of a matrix inverse, which has a complexity that, using current practical algorithms, is approximately cubic and as such the time spent calculating the inverse can grow very quickly.
We notice that X is typically very sparse and moreover some rows of X have significantly fewer non-zeroes than others (these rows are for terms with low frequency). Thus, if we take the first N 1 columns (documents) in X, it is possible to rearrange the rows of X with the result that there is some W 1 such that rows with index greater than W 1 have only zeroes in the columns up to N 1 . In other words, we take a subset of N 1 documents and enumerate the words in such a way that the terms occurring in the first N 1 documents are enumerated 1, . . . , W 1 . Let N 2 = N − N 1 , W 2 = W − W 1 . The result of this row permutation does not affect the value of X T X and we can write the matrix X as: where A is a W 1 × N 1 matrix representing term frequencies in the first N 1 documents, B is a W 1 ×N 2 matrix containing term frequencies in the remaining documents for terms that are also found in the first N 1 documents, and C is a W 2 × N 2 containing the frequency of all terms not found in the first N 1 documents.
Application of the well-known divide-andconquer formula (Bernstein, 2005, p. 159) for matrix inversion yields the following easily verifiable matrix identity, given that we can find C such that C C = I.
The inverse C is approximated by the Jacobi Preconditioner, J, of C T C:

Normalization
A key factor in the effectiveness of topic-based methods is the appropriate normalization of the elements of the document matrix X. This is even more relevant for orthonormal topics as the matrix inversion procedure can be very sensitive to small changes in the matrix. In this context, we consider two forms of normalization, term and document normalization, which can also be considered as row/column normalizations of X.
A straightforward approach to normalization is to normalize each column of X to obtain a matrix as follows: If we calculate X T X = Y then we get that the (i, j)-th element of Y is: Thus, the diagonal of Y consists of ones only and due to the Cauchy-Schwarz inequality we have that |y ij | ≤ 1, with the result that the matrix Y is already close to I. Formally, we can use this to state a bound on ||X T X − I|| F , but in practice it means that the orthonormalizing matrix has more small or zero values. Previous experiments have indicated that in general term normalization such as TF-IDF is not as effective as using the direct term frequency in ONETA, so we do not apply term normalization.

Application to grammar induction
The application to grammar induction is simply carried out by taking the rules and creating a single ground instance. That is if we have a rule of the form

LEAVING FROM <CITY>
We would replace the instance of <CITY> with a known terminal for this rule, e.g., leaving from Berlin This reduces the task to that of string similarity which can be processed by means of any string similarity function, for example such as the ONETA function described above. As such the procedure is as follows: For application, we used 20,000 Wikipedia articles, filtered to contain only those of over 100 words, giving us a corpus of 15.6 million tokens. We applied ONETA using document normalization but no term normalization and the value N 1 = 5000. These parameters were chosen based on the best results in previous experiments.

Conclusions
The results show that such a naive approach is not directly applicable to the case of grammar induction, however we believe that it is possible that the subtle semantic similarities captured by topic modelling may yet prove useful for grammar induction. However it is clear from the presented results that the use of a topic model alone does not suffice to solve this task. We notice that from the data many of the distinctions rely on antonyms and stop words, especially distinctions such as 'to'/'from', which are not captured by a topic model as topic models generally ignore stop words, and generally consider antonyms to be in the same topic, as they frequently occur together in text. The question of when semantic similarity such as provided by topic modelling is applicable remains an open question.
Philipp Sorg and Philipp Cimiano. 2010. An experimental comparison of explicit semantic analysis implementations for cross-language retrieval. In Natural Language Processing and Information Systems, pages 36-48. Springer.