Task Description: Grammar Induction for Spoken Dialogue Systems


This task aims to foster the application of computational models of lexical semantics to the field of spoken dialogue systems (SDS) specifically for the problem of grammar induction. Grammars constitute a vital component of SDS representing the semantics of the domain of interest. Our focus in this task is on finite state machine grammars.

The rules of such a grammar are divided into low-level and high-level. Low-level rules refer to basic concepts and are comprised by lexical items only. For example instances of the low-level rule <city> might be "New York", "London", "Paris".
Moving higher in the hierarchy of the grammar are the high-level rules, which are grouping of semantically related fragments which are composed of both lexical terms and low-level rules.
For example instances of the destination city concept <Dest.city> are "fly to <city>" and "arrive at <city>".

Using the above examples the sentence "I want to fly to Paris" will be first parsed as "I want to fly to <city>" and finally as "I want to <Dest.city>".

Although much effort has been invested by the community in automatically inducing and populating low-level rules from web resources and corpora, the problem of high-level rule induction is much less researched.  In addition to the theoretical merits of grammar induction models, the problem is of practical importance for assisting in the rapid prototyping/development of SDS.   

Low-level rule induction consists typically of two steps: multi-word term extraction [1] (e.g., “New York”, “John F. Kennedy airport”) and the induction/population of low-level rules [2, 4]. The aforementioned steps can be addressed by well-studied approaches including named entity recognition, estimation of semantic relatedness between words/terms [2, 4], clustering of semantically similar words/terms for the induction of low-level rules [2]. In this task, we aim to focus exclusively on high-level rule induction/population part of the automatic grammar creation process. Thus we assume that the low-level rules are known.

The evaluation campaign consists of one task as follows:

Task definition: Clustering of identified chunks into high-level grammar rules: creation of clusters consisting of semantically similar chunks (using a fragment semantic similarity metric). For example, the following two chunks: “depart from <City>” and “fly out of <City>” are based on the low-level rule <City> and they refer to the concept of departure city.

The task boils down to a semantic similarity estimation problem. Our research indicates that estimating semantic similarity between grammar fragments for various  SDS domains exhibits substantial differences compared to general-purpose word-, phrase- and sentence-level semantic similarity estimation and sentence-level (considered in the previous two SemEval conferences by the STS tasks). The major difference deals with the need to condition the estimation of similarity on the domain semantics. For example, crucial domain concepts are differentiated by slight variations of the lexical content, e.g., “flight to <City>” vs. “flight out of <City>”. Also, it is not clear how the existing sentence-level compositional models [3] can be downscaled for the case of chunks. Last but not least, the task aims to provide a complementary testbed for compositional semantic models, which currently are evaluated on certain syntactic constructions (e.g., noun-noun, adjective-noun) and/or complete sentences.

We acknowledge the importance of cross-domain and cross-language systems, so, we propose the use of data covering three domains and at least two languages for evaluation purposes. Regarding the training data data for two domains will be released: (i) air travel (flight, hotel and car bookings), and (ii) tourism (information for points of interest such as restaurants and movies). The testing data will include the above domains, plus a new domain. The data for the air travel domain will be available in two languages (English and Greek), while the data for the tourism domain will be in English.

2) Evaluation
The task will be evaluated as a clustering problem. The goal is to correctly assign the given chunks into the appropriate cluster, where each cluster stands for a particular high-level grammar rule. Note that a certain chunk is allowed to be assigned into a single cluster only (one-to-one clustering). The clustering performance will be computed in terms of precision (P), recall (R) and F-measure (F), where  F=(2*P*R)/(P+R). These three scores will be reported for each cluster, as well as across all clusters by computing the weighted average of  the per-cluster scores. The evaluation tool will be developed in a commonly used scripting language, e.g., Perl, which is available in the majority of operating systems.
3) References
[1] K. Frantzi and S. Ananiadou. “Automatic Term Recognition Using Contextual Cues”. In Proc. of IJCAI, 1997.
[2] E. Iosif and A. Potamianos. "A Soft-Clustering Algorithm for Automatic Induction of Semantic Classes". In Proc. of Interspeech, 2007.
[3] J. Mitchell and M. Lapata. "Composition in Distributional Models of Semantics". Cognitive Science, 34 (8), 1388–1429, 2010.
[4] H. M. Meng and K. C. Siu. “Semi-Automatic Acquisition of Semantic Structures for Understanding Domain-Specific Natural Language Queries”. IEEE Trasactions on Knowledge and Data Engineering, 14(1):172–181, 2002.

Contact Info


  • Elias Iosif, Dept. of ECE Technical University of Crete
  • Giannis Klasinas, Dept. of ECE Technical University of Crete
  • Alex Potamianos, Dept. of ECE Technical University of Crete
  • Katerina Louka, VoiceWeb SA

email : Elias Iosif, iosife@telecom.tuc.gr

Other Info


  • 30/06/2014 Task results are now online. Download here
  • 15/03/2014 Test sets for all 4 domains/languages released. Submission until March 30
  • 17/02/2014We released the train data in csv format (+some spelling corrections) as well as code for a baseline system and the evaluation. Available under the data and tools tab.