Evaluation

Evaluation Set-Up

Systems participating in the task will be evaluated based on the accuracy with which they can produce semantic dependency graphs for previously unseen text, measured relative to the gold-standard testing data.  The key measures for this evaluation will be labeled and unlabeled precision and recall with respect to predicted dependencies (predicateroleargument triples) and labeled and unlabeled exact match with respect to complete semantic dependency graphs.  In both contexts, identification of the top node(s) of a graph will be considered as the identification of additional, ‘virtual’ triples.  Below and in other task-related contexts, we will abbreviate these metrics as (a) labeled precision, recall, and F1: LP, LR, LF; (b) unlabeled precision, recall, and F1: UP, UR, UF; and (c) labeled and unlabeled exact match: LM, UM.

More practically speaking, as the task enters the evaluation period in mid-March, we will make available three copies of the test data, one for each target annotation, in the same token-oriented, tab-separated format as the training data, but with only columns (1) id, (2) form, (3) lemma, and (4) pos pre-filled.  Participating teams are expected to fill in the remaining columns (i.e. actual semantic dependency graphs) and submit the resulting files (one per target format) by the end of March.  Even though our three annotation formats annotate the exact same text (i.e. are sentence- and token-aligned), we provide three instances of the test data, as there may be variation in lemmatization and PoS assignment (unlike PAS and PCEDT, the DM annotations did not build on gold-standard PoS tags from the PTB).  For the open track (see below), we will further make available with the test data the same range of ‘companion’ syntactic analyses as are provided for the training data.

Sub-Tasks

For all three target formats, there will be two sub-tasks: a closed track and an open track.  Systems participating in the closed track can only be trained on the gold-standard semantic dependencies distributed for the task.  Systems participating in the open track may use additional resources, such as a syntactic parser, for example.  Test data for our task will draw on Section 21 of the WSJ Corpus, and therefore participants must make sure to not use any tools or resources that encompass knowledge of the gold-standard syntactic or semantic analyses of this section, i.e. are directly or indirectly trained or otherwise derived from WSJ Section 21.  Note that this restriction implies that off-the-shelf syntactic parsers may need to be re-trained, as many data-driven parsers for English include this section in their default training data.  To simplify participation in the open track, in mid-January 2014, we have made available syntactic analyses from two state-of-the-art parsers (re-trained without use of WSJ Section 21) as optional ‘companion’ data files; please see the data overview page for details.

Multiple Runs

Each participating team will be allowed to submit up to two different runs of their system (for each target format and, where applicable, both the closed and open tracks).  Separate runs could, for example, reflect different parameter settings or other relatively minor variation in the configuration of the system used to produce the submitted results.  Where genuinely different approaches are pursued within one team, i.e. separate systems that build on different methods, it may be legitimate to split the team, i.e. have two separate ‘teams’ from one site.  Please contact the organizers (at the email address indicated in the right column) if you feel your site might want to register as multiple teams.

Final Scoring

The ‘official’ ranking of participating systems, in both the closed and the open tracks, will be determined based on the arithmetic mean of the labeled dependency F1 scores (i.e. the geometric mean of labeled precision and labeled recall) on the three target representations (DM, PAS, and PCEDT).  Thus, to be considered for the final ranking, a system must submit semantic dependencies for all three target representations.

Software Support

Since late January 2014, the official scorer is available to participants as part of the emerging SDP toolkit.

Baseline

As a common point of reference, the organizers have prepared a simple baseline system, building on techniques from data-driven syntactic dependency parsing.  In a nutshell, we reduced the SDP graphs to trees. First, we eliminated re-entrancies in the graph by removing dependencies to nodes with multiple incoming edges, i.e. those that are the argument of more than one predicate.  Of these edges, we kept the dependency on the ‘closest’ predicate, as defined in terms of surface distance (with a preference for leftward predicates over rightward ones, in case of ties by distance). Second, we trivially incorporated all singleton nodes into the tree, by attaching nodes with neither incoming nor outgoing edges to the immediately following node, or to a virtual new root node (token ‘0’) in case a sentence-final node was a singleton; these synthesized dependencies, we called ‘_null_’. Finally, we integrated all fragments into one tree, by subordinating any remaining node without incoming edges to the root node, using a new dependency type called ‘_root_’.

Following our recommended split of the training data, we then trained the graph-based parser of Bohnet (2010) on Sections 00–19 of the (tree reduction of our) SDP data, and applied the resulting ‘syntactic’ parsing model to Section 20.  The table below indicates parser performance for our three target formats, evaluated both (a) at the level of Labeled and Unlabeled Attachment Scores (LAS and UAS, respectively; as computed by MaltEval), and (b) in terms of our SDP graph metrics, where for the latter the synthesized dependencies and any dependencies on the virtual root node were suppressed.  Note that this baseline makes no attempt at predicting top nodes, but in keeping with our ‘official’ metrics for this task, our figures for LP, LR and, LF include the virtual edges to top nodes.

  LAS UAS LP LR LF GF TP
DM 83.56 84.71 83.20 40.73 54.68 66.19 94.97
PAS 84.73 85.52 88.34 35.74 50.89 57.66 97.37
PCEDT 83.53 91.19 74.82 62.08 67.84 90.70 92.45

To put these results into perspective, the table above also includes two static measures of the ‘tree reductions’ of Section 20 of the SDP data, viz. the labeled F1 for the ‘gold’ trees scored as an SDP graph (GF), and their averaged per-token degree of tree projectivity (again, computed by MaltEval: TP).

Acknowledgements

We are grateful to Zeljko Agic for advice in designing the baseline, and for assistance in configuring MATE Tools and MaltEval.

Contact Info

Organizers

  • Dan Flickinger
  • Jan Hajič
  • Marco Kuhlmann
  • Yusuke Miyao
  • Stephan Oepen
  • Yi Zhang
  • Daniel Zeman

sdp-organizers@emmtee.net

Other Info

Announcements

[22-apr-14] Complete results (system submissions and official scores) as well as the gold-standard test data are now available for public download.

[31-mar-14] We have received submissions from nine teams; a draft summary of evaluation results has been emailed to participating teams.

[25-mar-14] We have posted some additional, task-specific instructions for how to submit system results to the SemEval evaluation; please make sure to follow these requirements carefully.

[22-mar-14] The test data (and corresponding ‘companion’ syntactic analyses, for use in the open track) are now available to registered participants; please see the task mailing list for details.

[08-mar-14] We have released a minor update to the companion archive, adding a handful of missing dependencies and fixing a problem in the file format.

[05-feb-14] We have posted the description of a baseline approach and experimental results on the suggested development sub-set of our training data (Section 20) on the evaluation page; on the same page, we have further specified the mechanics of submitting results to the evaluation.

[17-jan-14] Version 1.0 of the ‘companion’ data for the open track is now available, providing syntactic analyses (in phrase structure and bi-lexical dependency form) as overlays to our training data.  Please see the file README.txt in the companion archive for details.

[13-jan-14] We are releasing an update to the training data today, making a number of minor improvements to the DM and PCEDT graphs; also, we are now providing an on-line interface to search and explore visually the target representations for this task.  For details, please see our task-specific mailing list.

[12-dec-13] Some 750,000 tokens of WSJ text, annotated in our three semantic dependency formats will become available for download tomorrow.  To obtain the data, prospective participants need to enter a no-cost evaluation license with the Linguistic Data Consortium (LDC).  For access to the license form, please subscribe to our spam-protected mailing list.  Next, we are working to prepare our syntactic ‘companion’ data (to serve as optional input in the open track), which we expect to release in early January.

[24-nov-13] Version 1.1. of the trial data is now available, adding missing lemma values and streamlining argument labels in the DM format, removing a handful of items that used to have empty graphs in PAS, and generally aligning all items at the level of individual tokens (leaving 189 sentences in our trial data).  This last move means that all three formats now uniformly use single-character Unicode glyphs for quote marks, dashes, ellipses, and apostrophes (rather than multi-character LaTeX-style approxmiations, as were used in the original ASCII release of the text).  Furthermore, we encourage all interested parties, including prospective participants, to subscribe to our spam-protected mailing list, where we will post updates a little more frequently than on the general task web site.

[07-nov-13] We have clarified the interpretation of the top column (and renamed it from the earlier root) and elaborated the discussion of graph properties in the various formats.  We will continue to extend and revise the documentation on our three types of dependency graphs, but only announce such incremental changes here when they affect the data format.

[04-nov-13] A 198-sentence subset of what will be the training data has been released as trial data, to exemplify the file format and type of annotations available.  Please do get in touch, in case you see anything suprising!

[28-oct-13] We are in the process of finalizing the task description, posting some example dependencies, and making available some trial data.  For the time being, please consider these pages very much a work in progress, i.e. contents and form will be subject to refinement over the next few days

.