SAIL-GRS: Grammar Induction for Spoken Dialogue Systems using CF-IRF Rule Similarity

The SAIL-GRS system is based on a widely used approach originating from information retrieval and document indexing, the T F - IDF measure. In this implementation for spoken dialogue system grammar induction, rule constituent frequency and inverse rule frequency measures are used for estimating lexical and semantic similarity of candidate grammar rules to a seed set of rule pattern instances. The performance of the system is evaluated for the English language in three different domains, travel, tourism and ﬁnance and in the travel domain, for Greek. The simplicity of our approach makes it quite easy and fast to implement irrespective of language and domain. The results show that the SAIL-GRS system performs quite well in all three domains and in both languages.


Introduction
Spoken dialogue systems typically rely on grammars which define the semantic frames and respective fillers in dialogue scenarios (Chen et al., 2013). Such systems are tailored for specific domains for which the respective grammars are mostly manually developed (Ward, 1990;Seneff, 1992). In order to address this issue, numerous current approaches attempt to infer these grammar rules automatically (Pargellis et al., 2001;Meng and Siu, 2002;Yoshino et al., 2011;Chen et al., 2013).
The acquisition of grammar rules for spoken language systems is defined as a task comprising This work is licensed under a Creative Commons Attribution 4.0 International Licence. Page numbers and proceedings footer are added by the organisers. Licence details: http: //creativecommons.org/licenses/by/4.0/ of two subtasks (Meng and Siu, 2002;Iosif and Potamianos, 2007), the acquisition of: (i) Low-level rules These are rules defining domain-specific entities, such as names of locations, hotels, airports, e.g. CountryName: "USA", Date: "July 15th, 2014", CardType: "VISA" and other common domain multi-word expressions, e.g. DoYouKnowQ: "do you know".
(ii) High-level rules These are larger, frame-like rule patterns which contain as semantic slot fillers multi-word entities identified by low-level rules.
The shared task of Grammar Induction for Spoken Dialogue Systems, where our system participated, focused on the induction of high-level grammar rules and in particular on the identification and semantic classification of new rule patterns based on their semantic similarity to known rule instances.
Within this research framework, the work described in this paper proposes a methodology for estimating rule semantic similarity using a variation of the well-known measure of T F -IDF as rule constituent frequency vs. inverse rule frequency, henceforth CF -IRF .
In the remainder of this paper, we start in Section 2 by a detailed description of our system. Subsequently, in Section 3, we present the datasets used and the evaluation process, and in Section 4 we discuss our results. We conclude in Section 5 with a summary of our observations and directions for future work.

System Description
The SAIL-GRS system is based on a widely used approach in information retrieval and document indexing, the T F -IDF measure. T F -IDF is an approach that has found numerous applications in information management applications, such as document keyword extraction, (e.g., Dillon and Gray (1983)), document clustering, summarisation, (e.g., Gong and Liu (2001)), event clustering, (e.g., De Smet and Moens (2013)). In dialogue systems, T F -IDF has been used, among other applications, for discovering local coherence (Gandhe and Traum, 2007) and for acquiring predicate-argument rule fragments in an open domain, information extraction-based spoken dialogue system (Yoshino et al., 2011). In their approach, Yoshino et al. (2011) use the T F -IDF measure to determine the importance of a given word for a given domain or topic, so as to select the most salient predicate-argument structure rule patterns from their corpus.
In our implementation for spoken dialogue system grammar induction, rule constituent frequency (CF ) and inverse rule frequency (IRF ) measures are used for estimating lexical and semantic similarity of candidate grammar rules to a seed set of rule pattern instances. As illustrated in Table 1, the SAIL-GRS algorithm has two main steps, the training stage and the rule induction stage.
Input: known rule pattern instances Output: new candidate rule patterns Training stage: 1. Known rule instance parsing 2. Rule constituent extraction (uni-/bigrams) 3. Rule constituent frequency count (CF ) 4. Inverse rule frequency count (IRF ) 5. CF -IRF rule instance vector creation Rule induction stage: 1. Unknown text fragment parsing 2. Unigram & bigram extraction 3. Uni-/bigram CF -IRF value lookup 4. Creation of CF -IRF vector for unknown text fragment 5. Estimation of cosine similarity of unknown fragment to rule instances 6. New candidate rule selection & rule semantic category classification using maximum cosine similarity Table 1: The SAIL-GRS system algorithm.
In the first, the Training stage, known rule instances are parsed and, for each rule semantic category, the respective high-level rule pattern in-stances are acquired. These patterns are subsequently split into unigram and bigram constituents and the respective constituent frequencies and inverse rule frequencies are estimated. Finally, for each rule category, a vector representation is created for the respective rule pattern instance, based on the CF -IRF value of its unigram and bigram constituents.
In the second step, the Rule induction stage, the unknown text fragments are parsed and split into unigrams and bigrams. Subsequently, we lookup the known rule instance unigram and bigram representations for potential lexical matches to these new unigrams and bigrams. If these are found, then the new n-grams acquire the respective CF -IRF values found in the training instances and the respective CF -IRF vector for the unknown text fragments is created. Finally, we estimate the cosine similarity of this unknown text vector to each known rule vector. The unknown text fragments that are most similar to a given rule category are selected as candidate rule patterns and are classified in the known rule semantic category. An unknown text fragment that is selected as candidate rule pattern is assigned only to one, the most similar, rule category.

Experimental Setup
The overall objective in spoken dialogue system grammar induction is the fast and efficient development and portability of grammar resources. In the Grammar Induction for Spoken Dialogue Systems task, this challenge was addressed by providing datasets in three different domains, travel, tourism and finance, and by attempting to cover more than one language for the travel domain, namely English and Greek.
As illustrated in Table 2, the travel domain data for the two languages are comparable, with 32 and 35 number of known rule categories, for English and Greek, comprising of 982 and 956 high-level rule pattern instances respectively. The smallest dataset is the finance dataset, with 9 rule categories and 136 rule pattern instances, while the tourism dataset has a relatively low number of rule categories comprising of the highest number of rule pattern instances. Interestingly, as indicated in the column depicting the percent of unknown n-grams in the test-set, i.e. the unigrams and the bigrams without a CF -IRF value in the training data, the tourism domain test-set appears also to be the one with the greatest overlap with the training data, with a mere 0.72% and 4.84% of unknown unigrams and bigrams respectively.
For the evaluation, the system performance is estimated in terms of precision (P ), recall (R) and F -score measures, for the correct classification of an unknown text fragment to a given rule category cluster of pattern instances. In addition to these measures, the weighted average of the per rule scores is computed as follows: where N − 1 is the total number of rule categories, P i and R i are the per rule i scores for precision and recall, c i the unknown patterns correctly assigned to rule i, and n i the total number of correct rule instance patterns for rule i indicated in the ground truth data.

Results
The results of the SAIL-GRS system outperform the Baseline in all dataset categories, except the Tourism domain, as illustrated in Table 3. In this domain, both systems present the highest scores compared to the other domains. The high results in the travel domain are probably due to the high data overlap between the train and the test data, as discussed in the previous section and illustrated in Table 2. However, this domain was also the one with the highest average number of rule instances per rule category, compared to the other domains, thus presenting an additional challenge in the correct classification of unknown rule fragments.
We observe that the overall higher F measures of the SAIL-GRS system in the travel and finance domains are due to higher precision scores, whereas Baseline system displays higher recall but lower precision scores and lower F-measure in these domains.
The overall lowest scores for both systems are reached in the Travel domain for Greek, which is also the dataset with the lowest overlap with the training data. However, the performance of the SAIL-GRS system does not deteriorate to the same extent as the Baseline, the precision of which falls to a mere 0. 16-0.17, compared to 0.49-0.46 for the SAIL-GRS system.

Conclusion
In this work, we have presented the SAIL-GRS system used for the Grammar Induction for Spoken Dialogue Systems task. Our approach uses a fairly simple, language independent method for measuring lexical and semantic similarity of rule pattern instances. Our rule constituent frequency vs. inverse rule frequency measure, CF -IRF is a modification the T F -IDF measure for estimating rule similarity in the induction process of new rule instances.
The performance of our system in rule induction and rule pattern semantic classification was tested in three different domains, travel, tourism and finance in four datasets, three for English and an additional dataset for the travel domain in Greek. SAIL-GRS outperforms the Baseline in all datasets, except the travel domain for English. Moreover, our results showed that our system achieved an overall better score in precision and respective F-measure, in the travel and finance domains, even when applied to a language other than English. Finally, in cases of a larger percentage of unknown data in the test set, as in the Greek travel dataset, the smooth degradation of SAIL-GRS results compared to the Baseline indicates the robustness of our method.
A limitation of our system in its current version lies in the requirement for absolute lexical match with unknown rule unigrams and bigrams. Future extensions of the system could include rule constituent expansion using synonyms, variants or semantically or lexically similar words, so as to improve recall and the overall F-measure performance.   Table 3: Evaluation results for SAIL-GRS system compared to the baseline in all four datasets in terms of per rule Precision P , Recall R, and F-score F . In the grey column, P w , R w , and F w stand for the weighted average of the per rule precision, recall and F-score respectively, as defined in Equ. 1 and 2.