Broad-Coverage Semantic Dependency Parsing
Motivation
Syntactic dependency parsing has seen great advances in the past decade, in part owing to relatively broad consensus on target representations, and in part reflecting the successful execution of a series of CoNLL shared tasks. From this very active research area accurate and efficient syntactic parsers have developed for a wide range of natural languages. However, the predominant target representation in dependency parsing to date are trees, in the formal sense that every node in the dependency graph is reachable from a distinguished root node by exactly one directed path. This assumption is an essential prerequisite for both the parsing algorithms and the machine learning methods in state-of-the-art syntactic dependency parsers. Unfortunately, this means that these parsers are ill-suited for producing meaning representations, i.e. moving from the analysis of grammatical structure to sentence semantics. Even if syntactic parsing arguably can be limited to tree structures, this is obviously not the case in semantic analysis, where a node will often be the argument of multiple predicates (i.e. have more than one incoming arc), and it will often be desirable to leave some nodes unattached (with no incoming arcs), for semantically vacuous classes as, for example, particles, complementizers, or relative pronouns.
Thus, this task seeks to stimulate the dependency parsing community to move towards more general graph processing, to thus enable semantic dependency parsing, i.e. a more direct analysis of ‘who did what to whom’. We attach four sample semantic dependency graphs (as a PDF file), demonstrating the target representations we want to employ for the WSJ sentence:
A similar technique is almost impossible to apply to other crops, such as cotton, soybeans, and rice.
Here, ‘technique’ for example, is the argument of at least the determiner (as the quantificational locus), the modifier ‘similar’, and the predicate ‘apply’. Conversely, the predicative copula, infinitival ‘to’, and the particle marking the deep object of ‘apply’ arguably have no semantic contribution of their own. Besides calling for node re-entrancies and partial connectivity, an adequate representation of semantic dependencies will likely also exhibit higher degrees of non-projectivity than typical syntactic dependency trees.
Besides the relation to syntactic dependency parsing, the proposed task also has some overlap with Semantic Role Labeling (SRL). In much previous work, however, target representations typically draw on resources like PropBank and NomBank, which are limited to argument identification and labeling for verbal and nominal predicates. A plethora of semantic phenomena, e.g. negation and other scopal embedding, comparatives, possessives, various types of modification, and even conjunction, typically remain unanalyzed in SRL. Thus, target representations are partial to a degree that can prohibit semantic downstream processing, for example inference-based techniques. In this task, we require parsers to identify all semantic dependencies, i.e. compute a representation that integrates all content words in one structure. Nevertheless, we anticipate that relatively straightforward adaptations of existing SRL approaches can be applied to yield broad-coverage semantic dependency parsing.
In recent years, we see beginning research into parsing with graph-structured representations, for example Sagae & Tsujii (2008), Das, et al. (2010), Jones, et al. (2013), Chiang, et al. (2013), and Henderson, et al. (2013). However, some of these studies are purely theoretical, others limited to smaller, non-standard data sets. We anticipate an increase in interest for this line of research, as well as emerging resources that can mature into broadly accepted target representations of semantic dependencies.
For these reasons, we expect that a SemEval 2014 task would be a good vehicle to pull together, better understand, and make more widely accessible candidate target annotations, as well as to energize and synchronize emerging work on algorithms and statistical models for parsing into these types of more semantic representations.
Training and Testing Data
For English, we are aware of three independent annotations over the venerable WSJ text underlying the Penn Treebank (PTB) that have the formal and linguistic properties we are looking for:
- DM: The reduction of Minimal Recursion Semantics, available through the HPSG annotation of the WSJ text, into bi-lexical dependencies (Flickinger, et al., 2012).
- PAS: Predicate-Argument Structures extracted from another HPSG annotation of the PTB phrase structure trees (Miyao, et al., 2004).
- PCEDT: The tectogrammatical analysis layer of the Prague Czech-English Dependency Treebank (Cinková, et al., 2009).
These resources constitute parallel semantic annotations over the same common text, but to date they have not been related to each other and, actually, have hardly been used for training and testing of data-driven analyzers. As is evident in our running example, showing the DM, PAS, and PCEDT semantic dependency graphs for the sentence above, there are contentful differences among these annotations, and there is of course not one obvious (or even objective) truth. For this task, we will synchronize these resources at the sentence and tokenization levels (making sure they all annotate the exact same text), for approximately 750,000 annotated tokens in the WSJ domain. More background on the linguistic characterization of these representations as well as on the task data format is available through separate pages.