UW-MRS: Leveraging a Deep Grammar for Robotic Spatial Commands

This paper describes a deep-parsing approach to SemEval-2014 Task 6, a novel context-informed supervised parsing and semantic analysis problem in a controlled domain. The system comprises a handbuilt rule-based solution based on a preexisting broad coverage deep grammar of English, backed up by a off-the-shelf datadriven PCFG parser, and achieves the best score reported among the task participants.


Introduction
SemEval-2014 Task 6 involves automatic translation of natural language commands for a robotic arm into structured "robot control language" (RCL) instructions (Dukes, 2013a). Statements of RCL are trees, with a fixed vocabulary of content words like prism at the leaves, and markup like action: or destination: at the nonterminals. The yield of the tree largely aligns with the words in the command, but there are frequently substitutions, insertions, and deletions.
A unique and interesting property of this task is the availability of highly relevant machinereadable descriptions of the spatial context of each command. Given a candidate RCL fragment describing an object to be manipulated, a spatial planner provided by the task organizers can automatically enumerate the set of task-world objects that match the description. This information can be used to resolve some of the ambiguity inherent in natural language.
The commands come from the Robot Commands Treebank (Dukes, 2013a), a crowdsourced corpus built using a game with a purpose (von Ahn, 2006). Style varies considerably, with missing determiners, missing or unexpected punc-tuation, and missing capitalization all common (Dukes, 2013b). Examples (1) and (2) show typical commands from the dataset.
(1) drop the blue cube (2) Pick yellow cube and drop it on top of blue cube Although the natural language commands vary in their degree of conformance to what might be called standard English, the hand-built gold standard RCL annotations provided with them (e.g. Figure 1) are commendable in their uniformity and accuracy, in part because they have been automatically verified against the formal before and after scene descriptions using the spatial planner.

Related Work
Automatic interpretation of natural language is a difficult and long-standing research problem. Some approaches have taken a relatively shallow view; for instance, ELIZA (Weizenbaum, 1966) used pattern matching to somewhat convincingly participate in an English conversation. Approaches taking a deeper view tend to parse utterances into structured representations. These are usually abstract and general-purpose in nature, e.g. the syntax trees produced by mainstream PCFG parsers and the DRS produced by the Boxer system (Bos, 2008). As a notable exception, Dukes (2014) presents a novel method to produce RCL output directly.
The English Resource Grammar (ERG; Flickinger, 2000) employed as a component in the present work is a broad-coverage precision hand-written unification grammar of English, following the Head-driven Phrase Structure Grammar theory of syntax (Pollard & Sag, 1994). The ERG produces Minimal Recursion Semantics (MRS; Copestake et al., 2005) analyses, which are flat structures that explicitly encode predicate argument relations (and other data). A simplified MRS structure is shown in Figure 2. With minor modifications to allow determinerless NPs and some unexpected measure noun lexemes (as in "two squares to the left", etc), the ERG yields analyses for 99% of the commands in the training portion of the Robot Command Treebank.

ERG-based RCL Synthesis
This section outlines the method my system employs to synthesize RCL outputs from the MRS analyses produced by the ERG. The ERG provides a ranked list of candidate MRS analyses for each input. As a first step, grossly inappropriate analyses are ruled out, e.g. those proposing non-imperative main verbs or domain-inappropriate parts of speech ("block" as a verb). An attempt is made to convert each remaining analysis into a candidate RCL statement. If conversion is successful, the result is tested for coherence with respect to the known world state, using the supplied spatial planner. An RCL statement is incoherent if it involves picking up or moving an entity which does not exist, or if its command type (take, move, drop) is incompatible with the current state of the robot arm, e.g. drop is incoherent when the robot arm is not holding anything. Processing stops as soon as a coherent result is found. 1

From MRS to RCL
Given an individual (imperative) MRS structure, the first step in conversion to RCL is to identify the sequence of top-level verbal predications. The INDEX property of the MRS provides an entry point. In a simple command like Example (1), the INDEX will point to a single verbal predication, whereas in a compound command such as Example (2), the INDEX will point to a coordination predication, which itself will have left and right arguments which must be visited recursively. Each verbal predication visited in this manner generates an event: RCL statement whose action: property is determined by a looking up the verbal predicate in a short hand-written table (e.g. drop v cause maps to action: drop). If the predicate is not found in the table, the most common action move is guessed.
Every RCL event: element must have an entity: subelement, representing the object to be moved by the action. Although in principle MRS makes no guarantees about the generalizability of the semantic interpretation of argument roles across different predicates, in practice the third argument of every verbal predicate relevant to this domain represents the object to be moved; hence, synthesis of an event: proceeds by inspecting the third argument of the MRS predicate which gave rise to it. Some types of event: also involve a destination: subelement, which encodes the location where the entity should come to rest. When present, a verbal predicate's fourth argument almost always identifies a prepositional predication holding this information, although there are exceptions (e.g. for move v from-to rel it is the fifth). When no such resultative role is present, the first prepositional modifier (if any) of the verbal event variable is used for the destination: subelement.
Synthesis of an entity: element from a referential index like y in Figure 2 or a spatial-relation: element from a prepositional predication proceeds in much the same way: the RCL type: or relation: is determined by a simple table lookup, and subelements are built based on connections indicated in the MRS. One salient difference is the treatment of predicates that are not found in their respective lookup tables. Whereas unknown command predicates default to the most common action move, unknown modifying spatial relations are simply dropped, 2 and unknown entity types cause conversion to fail, on the theory that an incorrect parse is likely. Prudent rejection of suspect parses only rarely eliminates all available analyses, and generally helps to find the most appropriate one. On development data, the first analysis produced by the ERG was convertible for 87% of commands, and the first RCL hypothesis was spatially coherent for 96% of commands. These numbers indicate that the parse ranking component of the ERG works quite well.

Polishing the Rules
I split the 2500 task-supplied annotated commands into a randomly-divided training set (2000 commands) and development set (500 commands). Throughout this work, the development set was only used for estimating performance on unseen data and tuning system combination settings; the contents of the development set were never inspected for rule writing or error analysis purposes. Although the conversion architecture outlined above constitutes an effective framework, there were quite a few details to be worked through, such as the construction of the lookup tables, identification of cases requiring special handling, elimination of undesirable parses, modest extension of the ERG, etc. An error-analysis tool which performed a fine-grained comparison of the synthesized RCL statements with the goldstandard ones and agglomerated common error types proved invaluable when writing rules. 3 Polishing the system in this manner took about two weeks of part-time effort; I maintained a log giving a short summary of each tweak (e.g. "map center n of rel to type: region"). These tweaks required varying amounts of time to implement, from a few seconds up to perhaps an hour; system accuracy as a function of the number of such tweaks is shown in Figure 3.

Anaphora and Ellipsis
Some commands use anaphora to evoke the identity or type of previously mentioned entities. Typically, the pronoun "it" refers to a specific entity while the pronoun "one" refers to the type of an entity (e.g. "Put the red cube on the blue one."). Empirically, the antecedent is nearly always the first entity: element in the RCL statement, and this heuristic works well in the system. A small fraction of commands (< 0.5% of the training data) elide the pronoun, in commands like "Take the blue tetrahedron and place in front left corner." In principle these could be detected and accommodated through the addition of a simple mal-rule to the ERG (Bender et al., 2004), but for simplicity my system ignores this problem, leading to errors.

Robustness Strategies
If none of the analyses produced by the ERG result in coherent RCL statements, the system produces no output. On the one hand this results in quite a high precision: on the training data, 96.75% of the RCL statements produced are exactly correct. On the other hand, in some scenarios a lower precision result may be preferable to no result. The ERGbased system fails to produce any output for 3.1% of the training data inputs, a number that should be expected to increase for unseen data (since conversion can sometimes fail when the MRS contains unrecognized predicates).
In order to produce a best-guess answer for these remaining items, I employed the Berkeley parser (Petrov et al., 2006), a state-of-the-art datadriven system that induces a PCFG from a usersupplied corpus of strings annotated with parse trees. The RCL treebank is not directly suitable as training material for the Berkeley parser, since the yield of an RCL tree is not identical to (or even in 1-to-1 correspondence with) the words of the input utterance. In the interest of keeping things simple, I produced a phrase structure translation of the RCL treebank by simply discarding the elements of the RCL trees that did not correspond to any input, and inserting (X word) nodes for input words that were not aligned to any RCL fragment. The question of where in the tree to insert these X nodes is presumably of considerable importance, but again in the interest of simplicity I simply clustered them together with the first RCL- aligned word appearing after them. Unaligned input tokens at the end of the sentence were added as siblings of the root node. Figure 4 shows the phrase structure tree resulting from the translation of the RCL statement shown in Figure 1. Using this phrase structure treebank, the Berkeley parser tools make it possible to automatically derive a similar phrase structure tree for any input string, and indeed when the input string is a command such as the ones of interest in this work, the resulting tree is quite close to an RCL statement. Deletion of the X nodes yields a robust system that frequently produces the exact correct RCL, at least for those items where only input-aligned RCL leaves are required. The most common type of non-input-aligned RCL fragment is the id: element, identifying the antecedent of an anaphor. As with the ERG-based system, a heuristic selecting the first entity as the antecedent whenever an anaphor is present works quite well.
Improving the output of the statistical system via tweaks of the type used in the ERG-based system was much more challenging, due to the relative impoverishedness of the information made available by the parser. Accurately detecting situations to improve without causing collateral damage proved difficult. However, the base accuracy of the statistical system was quite good, and when used as a back-off it improved overall system scores considerably, as shown in Table 5.

Results and Discussion
The combined system performs best on both portions of the data. Over the development data, the MRS-based system performs considerably better than the statistical system, in part due to the use of spatial planning in the MRS-based system (time did not permit adding spatial planning to the statis-  Figure 5: Evaluation results. ±SP indicates whether or not spatial planning was used. The robust and combined systems always returned a result, so P = R. tical system). The statistical system has a slightly higher recall than the MRS-only system without spatial planning, but the MRS-only system has a higher precision -markedly so on the evaluation data. This is consistent with previous findings combining precision grammars with statistical systems (Packard et al., 2014). ERG coverage dropped precipitously from roughly 99% on the development data to 91% on the evaluation data. This is likely the major cause of the 10% absolute drop in the recall of the MRS-only system. The fact that the robust statistical system encounters a comparable drop on the evaluation data suggests that the text is qualitatively different from the (also held-out) development data. One possible explanation is that whereas the development data was randomly selected from the 2500 task-provided training commands, the evaluation data was taken as the sequentially following segment of the treebank, resulting in the same distribution of game-with-apurpose participants (and hence writing styles) between the training and development sets but a different distribution for the evaluation data. 4 Dukes (2014) reports an accuracy of 96.53%, which appears to be superior to the present system; however, that system appears to have used more training data than was available for the shared task, and averaged scores over the entire treebank, making direct comparison difficult.