Representations

Motivation

As additional background to the task, this page provide a high-level description for each of our three types of semantic dependency graphs, in lexicographic order.  For a first impression, please consult our example graphs for a single sentence.  For further information, please consult the external documents linked in each section.

DM: MRS-Derived Semantic Dependencies

These semantic dependencies come from the annotation of Sections 00–21 of the WSJ Corpus with gold-standard HPSG analyses provided by the LinGO English Resource Grammar (ERG).  Among other layers of linguistic analysis, this resource—dubbed DeepBank by Flickinger et al. (2012)—includes logical-form meaning representations in the framework of Minimal Recursion Semantics (MRS).

DM bi-lexical semantic dependencies, as used in this task, result from a two-stage ‘reduction’ (i.e. simplification) of full MRS analyses.  First, Oepen & Lønning (2006) define a conversion from MRS to variable-free Elementary Dependency Structures (EDS); this step is lossy, in that some scope-related information is discarded.  Second, while EDS typically contains dependency nodes that correspond to non-lexical units (e.g. construction-specified semantics, as in nominal compounding or the formation of bare noun phrases), it can be further reduced into strictly bi-lexical form through the conversion defined by Ivanova et al. (2012).  Although some aspects of construction-specific semantics can be projected onto binary word-to-word dependencies, this step, too, is not information-preserving, i.e. our DM semantic dependency graphs present a true subset of the information encoded in the full, original MRS.

Well-formed DM graphs have a unique top node, but the structural root(s) of the graph need not be the top node.   DM graphs are predominantly semi-connected (with about ten percent exceptions), i.e. all nodes are either reachable from the top by at least one undirected path, or they form ‘unconnected’ singletons (have no incoming or outgoing edges); such singleton nodes correspond to semantically vacuous tokens, e.g. complementizers or relative pronouns.  There can be re-entrancies in DM graphs, for example in the analysis of control predicates or relative clauses, but there are no cycles.  Role labels in DM (ARG1, ..., ARGn) are semantically ‘bleached’, in the formal sense of allowing unambiguous, per-predicate argument labeling but not aiming to provide a globally consistent role labeling, as underlies for example the notions of proto-agents and proto-patients in PropBankCopestake (2009) provides a contrastive discussion of the two points of view.  Some construction semantics, at the DM level, is encoded through additional role labels, e.g. appos, compound, measure, part, or poss, for apposition, compounding, measure phrases, partitives, or possessives, respectively.  Unlike in the underlying ERG (and thus MRS), coordinate structures in DM adopt what is often called a Mel'čukian analysis of coordination (Mel'čuk 1988).

PAS: Enju Predicate–Argument Structures

This data set is derived from the HPSG-based annotation of Penn Treebank, which is used for training the wide-coverage HPSG parser Enju.  With a wide-coverage grammar and a probabilistic model obtained from this treebank, Enju can effectively analyze syntactic/semantic structures of English sentences and output phrase structures and predicate-argument structures. The Enju parser has successfully been applied to various NLP applications, including information extraction and machine translation.

While DM comes from manually annotated HPSG analyses, the HPSG treebank of Enju is automatically converted from the original bracketing annotations of Penn Treebank, by the method of Miyao, Ninomiya and Tsujii (IJCNLP 2004).  The conversion program has been carefully tuned, although the automatic conversion may produce erroneous analyses due to mis-application of conversion rules and/or annotation errors in the original treebank.

The data set provided in this shared task is a simplified version of predicate-argument structures of the Enju HPSG treebank.  Semantic dependencies, such as semantic subject and object, are represented as word-to-word dependencies, while other linguistic features and scope information are removed.

For full details, refer to the following documents of the Enju parser.

  • Enju Output Speficications introduce the output format of the Enju parser.  The document contains information about predicate argument structures, which are the main target of this shared task (See Section 3 "Predicate argument structures" of the document).  Dependency labels provided in this shared task are a concatenation of a value of "pred" and an argument label ("arg1", "arg2", etc.).
  • Enju XML Format provides examples of the XML format of the Enju parser, which represents predicate argument structures as well as phrase structures.  The data set for this shared task is derived from this format by removing phrase structure information.

PCEDT: Parts of the Tectogrammatical Layer

The Prague Czech-English Dependency Treebank 2.0 (PCEDT 2.0) is a manually parsed Czech-English parallel corpus.  The English part contains the entire Wall Street Journal section of the Penn Treebank (http://catalog.ldc.upenn.edu/LDC99T42); these texts have been translated and 1:1 sentence-aligned to Czech.  Only those PTB documents which have both POS and structural annotation (total of 2312 documents) have been made part of this release.  We use only the English part of the treebank for the present task.

Each language part is enhanced with a comprehensive manual linguistic annotation in the PDT 2.0 style (Prague Dependency Treebank 2.0, http://catalog.ldc.upenn.edu/LDC2006T01).  The main features of this annotation style are:

  • dependency structure of the content words and coordinating and similar structures (function words are attached as their attribute values)
  • semantic labeling of content words and types of coordinating structures
  • argument structure, including a “valency” lexicon for both languages
  • ellipsis and anaphora resolution

This annotation style is called tectogrammatical annotation and it constitutes the tectogrammatical layer in the corpus. The English manual tectogrammatical annotation was built above an automatic transformation of the original phrase-structure annotation of the Penn Treebank into surface dependency (analytical) representations, using the following additional linguistic information from other sources:

Only a subset of the original tectogrammatical annotation is used for the SDP task:

  • The set of graph nodes is equivalent to the set of surface tokens.  PCEDT t-trees contain additional nodes representing elided elements; these nodes are not available in SDP data.  On the other hand, functional and punctuation tokens are visible in the SDP data, although they are normally hidden in PCEDT t-trees.
  • The attachment of function words to content words in PCEDT is ignored.  Most function nodes remain unconnected in SDP graphs (exception: paratactic structures).
  • Coreference links are ignored.
  • The SDP data does not contain grammatemes.

Most dependency labels mark semantic roles of arguments.  The labels are called functors in PCEDT.  Their meaning is detailed in documentation (http://ufal.ms.mff.cuni.cz/pcedt2.0/publications/TR_En.pdf), see page 107 and onwards.

For technical reasons, dependency structures in PCEDT are always rooted trees, even in paratactic constructions where the relations are not true dependencies. In the process of conversion to the SDP file format, true dependency relations were extracted.  For instance, coordinate actors in PCEDT would be attached to the conjunction, and only the conjunction would be attached to the verb.  The former attachments would be labeled ACT (actor) while the latter would be a technical link labeled CONJ.  In contrast, the SDP data directly shows ACT links from the verb to the conjuncts, thus showing the true bi-lexical dependencies.  In addition, there are also links from the conjunction to the conjuncts and they are labeled CONJ.member.  These links preserve the paratactic structure (which can even be nested) and the type of coordination. The unconnected function words apart, paratactic constructions are the only areas where the SDP graphs are not trees.

Original PCEDT trees always have an artificial root node that does not correspond to any input token.  In the extracted SDP graphs, nodes that were direct children of the artificial root in PCEDT are marked as top nodes.  Typically there is just one top node per graph.  Coordinating conjunctions are not marked as additional top nodes even though they have several outgoing and no incoming edges.

Following is the original PCEDT tectogrammatical tree for our running example:  A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice.

Contact Info

Organizers

  • Dan Flickinger
  • Jan Hajič
  • Marco Kuhlmann
  • Yusuke Miyao
  • Stephan Oepen
  • Yi Zhang
  • Daniel Zeman

sdp-organizers@emmtee.net

Other Info

Announcements

[22-apr-14] Complete results (system submissions and official scores) as well as the gold-standard test data are now available for public download.

[31-mar-14] We have received submissions from nine teams; a draft summary of evaluation results has been emailed to participating teams.

[25-mar-14] We have posted some additional, task-specific instructions for how to submit system results to the SemEval evaluation; please make sure to follow these requirements carefully.

[22-mar-14] The test data (and corresponding ‘companion’ syntactic analyses, for use in the open track) are now available to registered participants; please see the task mailing list for details.

[08-mar-14] We have released a minor update to the companion archive, adding a handful of missing dependencies and fixing a problem in the file format.

[05-feb-14] We have posted the description of a baseline approach and experimental results on the suggested development sub-set of our training data (Section 20) on the evaluation page; on the same page, we have further specified the mechanics of submitting results to the evaluation.

[17-jan-14] Version 1.0 of the ‘companion’ data for the open track is now available, providing syntactic analyses (in phrase structure and bi-lexical dependency form) as overlays to our training data.  Please see the file README.txt in the companion archive for details.

[13-jan-14] We are releasing an update to the training data today, making a number of minor improvements to the DM and PCEDT graphs; also, we are now providing an on-line interface to search and explore visually the target representations for this task.  For details, please see our task-specific mailing list.

[12-dec-13] Some 750,000 tokens of WSJ text, annotated in our three semantic dependency formats will become available for download tomorrow.  To obtain the data, prospective participants need to enter a no-cost evaluation license with the Linguistic Data Consortium (LDC).  For access to the license form, please subscribe to our spam-protected mailing list.  Next, we are working to prepare our syntactic ‘companion’ data (to serve as optional input in the open track), which we expect to release in early January.

[24-nov-13] Version 1.1. of the trial data is now available, adding missing lemma values and streamlining argument labels in the DM format, removing a handful of items that used to have empty graphs in PAS, and generally aligning all items at the level of individual tokens (leaving 189 sentences in our trial data).  This last move means that all three formats now uniformly use single-character Unicode glyphs for quote marks, dashes, ellipses, and apostrophes (rather than multi-character LaTeX-style approxmiations, as were used in the original ASCII release of the text).  Furthermore, we encourage all interested parties, including prospective participants, to subscribe to our spam-protected mailing list, where we will post updates a little more frequently than on the general task web site.

[07-nov-13] We have clarified the interpretation of the top column (and renamed it from the earlier root) and elaborated the discussion of graph properties in the various formats.  We will continue to extend and revise the documentation on our three types of dependency graphs, but only announce such incremental changes here when they affect the data format.

[04-nov-13] A 198-sentence subset of what will be the training data has been released as trial data, to exemplify the file format and type of annotations available.  Please do get in touch, in case you see anything suprising!

[28-oct-13] We are in the process of finalizing the task description, posting some example dependencies, and making available some trial data.  For the time being, please consider these pages very much a work in progress, i.e. contents and form will be subject to refinement over the next few days

.