Evaluation

Evaluation Set-Up

Systems participating in the task will be evaluated based on the accuracy with which they can produce semantic dependency graphs for previously unseen text, measured relative to the gold-standard testing data.  The key measures for this evaluation will be labeled and unlabeled precision and recall with respect to predicted dependencies (predicateroleargument triples) and labeled and unlabeled exact match with respect to complete semantic dependency graphs.  In both contexts, identification of the top node(s) of a graph will be considered as the identification of additional, ‘virtual’ triples.  Below and in other task-related contexts, we will abbreviate these metrics as (a) labeled precision, recall, and F1: LP, LR, LF; (b) unlabeled precision, recall, and F1: UP, UR, UF; and (c) labeled and unlabeled exact match: LM, UM.

In addition to these metrics, which have already been implemented in SDP 2014 (the first incarnation of this Task), we will define two additional metrics that aim to capture fragments of semantics that are ‘larger’ than individual dependencies but ‘smaller’ than the semantic dependency graph for the complete sentence, viz. what we call (a) complete predications and (b) semantic frames.  In the SDP 2015 context, a complete predication is comprised of the set of all core arguments to one predicate, which for the DM and PAS target representations corresponds to all outgoing dependency edges, and for the PSD target representation to only those outgoing dependencies marked by an ‘-arg’ suffix on the edge label.

Pushing the units of evaluation one step further towards unit of interpretation, a semantic frame is comprised of a complete predication combined with the sense (or frame) identifier of its predicate.  Both complete-predicate and semantic-frame evaluation will be restricted to predicates corresponding to major parts of speech (verbs, probably also nouns and adjectives, and possibly specialized phenomena like possessives), and semantic frames will be further restricted to those target representations and lexical categories for which sense information is available in our data (DM and PSD, with PSD senses limited to verbs).  As with the per-dependency evaluation, we will score precision, recall, and F1—which we abbreviate as PP, PR, and PF for complete predications, and FP, FR, and FF for semantic frames.

More practically speaking, as the task enters the evaluation period in mid-December, we will make available three copies of the test data, one for each target annotation, in the same token-oriented, tab-separated format as the training data, but with only columns (1) id, (2) form, (3) lemma, and (4) pos pre-filled.  Participating teams are expected to fill in the remaining columns (i.e. actual semantic dependency graphs) and submit the resulting files (one per target format) by December 20, 2014.  Even though our three target representations annotate the exact same text (i.e. are sentence- and token-aligned), we provide three instances of the test data, as there may be variation in lemmatization and PoS assignment (unlike PAS and PSD, the DM annotations did not build on gold-standard PoS tags from the PTB).  For the open and gold tracks (see below), we will further make available with the test data the same range of ‘companion’ syntactic analyses as are provided for the training data.

Sub-Tasks

For all three target formats, there will be three sub-tasks: a closed track, an open track, and a gold track.  Systems participating in the closed track can only be trained on the gold-standard semantic dependencies distributed for the task.  Systems participating in the open track may use additional resources, such as a syntactic parser, for example.  Test data for our task will draw on Section 21 of the WSJ Corpus, and therefore participants must make sure to not use any tools or resources that encompass knowledge of the gold-standard syntactic or semantic analyses of this section, i.e. are directly or indirectly trained or otherwise derived from WSJ Section 21.  Note that this restriction implies that off-the-shelf syntactic parsers may need to be re-trained, as many data-driven parsers for English include this section in their default training data.  To simplify participation in the open track, in mid-August 2014, we will make available syntactic analyses from several state-of-the-art parsers (re-trained without use of WSJ Section 21) as optional ‘companion’ data files; please see the data overview page for details.  Finally, the goal of the gold track is to more directly investiigate the contributions of syntactic structure on the semantic dependency parsing problem.  for submissions to this track, we will make available (by the end of August 2014) gold-standard syntactic analyses in a variety of formats, including those used natively by the annotation initiatives from which our semantic dependency graphs derive, viz. HPSG derivation trees reduced to bi-lexical dependencies (for DM and PAS) and Prague analytical trees (for PSD).

Multiple Runs

Each participating team will be allowed to submit up to two different runs of their system (for each target format and, where applicable, both the closed and open tracks).  Separate runs could, for example, reflect different parameter settings or other relatively minor variation in the configuration of the system used to produce the submitted results.  Where genuinely different approaches are pursued within one team, i.e. separate systems that build on different methods, it may be legitimate to split the team, i.e. have two separate ‘teams’ from one site.  Please contact the organizers (at the email address indicated in the right column) if you feel your site might want to register as multiple teams.

Final Scoring

The ‘official’ ranking of participating systems, in both the closed and the open tracks, will be determined based on the arithmetic mean of the labeled dependency F1 scores (i.e. the geometric mean of labeled precision and labeled recall) on the three target representations (DM, PAS, and PCEDT).  Thus, to be considered for the final ranking, a system must submit semantic dependencies for all three target representations.

Software Support

Towards the end of August 2014, we will make available to participants the official scorer as part of the emerging SDP toolkit.

Baseline

As a common point of reference, the organizers have prepared a simple baseline system, building on techniques from data-driven syntactic dependency parsing.  In a nutshell, we reduced the SDP graphs to trees. First, we eliminated re-entrancies in the graph by removing dependencies to nodes with multiple incoming edges, i.e. those that are the argument of more than one predicate.  Of these edges, we kept the dependency on the ‘closest’ predicate, as defined in terms of surface distance (with a preference for leftward predicates over rightward ones, in case of ties by distance). Second, we trivially incorporated all singleton nodes into the tree, by attaching nodes with neither incoming nor outgoing edges to the immediately following node, or to a virtual new root node (token ‘0’) in case a sentence-final node was a singleton; these synthesized dependencies, we called ‘_null_’. Finally, we integrated all fragments into one tree, by subordinating any remaining node without incoming edges to the root node, using a new dependency type called ‘_root_’.

Following our recommended split of the training data, we then trained the graph-based parser of Bohnet (2010) on Sections 00–19 of the (tree reduction of our) SDP data, and applied the resulting ‘syntactic’ parsing model to Section 20.  The table below indicates parser performance for our three target formats, evaluated both (a) at the level of Labeled and Unlabeled Attachment Scores (LAS and UAS, respectively; as computed by MaltEval), and (b) in terms of our SDP graph metrics, where for the latter the synthesized dependencies and any dependencies on the virtual root node were suppressed.  Note that this baseline makes no attempt at predicting top nodes, but in keeping with our ‘official’ metrics for this task, our figures for LP, LR and, LF include the virtual edges to top nodes.

  LAS UAS LP LR LF GF TP
DM 83.56 84.71 83.20 40.73 54.68 66.19 94.97
PAS 84.73 85.52 88.34 35.74 50.89 57.66 97.37
PCEDT 83.53 91.19 74.82 62.08 67.84 90.70 92.45

To put these results into perspective, the table above also includes two static measures of the ‘tree reductions’ of Section 20 of the SDP data, viz. the labeled F1 for the ‘gold’ trees scored as an SDP graph (GF), and their averaged per-token degree of tree projectivity (again, computed by MaltEval: TP).

Acknowledgements

We are grateful to Zeljko Agic and Bernd Bohnet for advice in designing the baseline and ‘companion’ data, and for assistance in configuring MATE Tools and MaltEval, as well as to Milen Kouylekov for the development and support of the interactive search interface.

Contact Info

Organizers

  • Dan Flickinger
  • Jan Hajič
  • Angelina Ivanova
  • Marco Kuhlmann
  • Yusuke Miyao
  • Stephan Oepen
  • Daniel Zeman

sdp-organizers@emmtee.net

Other Info

Announcements

[06-feb-15] Final evaluation results for the task are now available; we are grateful to all (six) participating teams.

[08-dec-15] The evaluation period is nearing completion; we have purged inactive subscribers from the task-specific mailing list and sent out important information on the submssion of system outputs for evaluation to the list; if you have not received this email but are actually preparing a system submission, please contact the organizers immediately.

[17-dec-14] We are about to enter the evaluation phase, but recall that the closing date has been extended to Thursday, January 15, 2015. We have sent important instructions on how to participate in the evaluation to the task-specific mailing list; if you plan on submitting system results to this task but have not seen these instructions, please make contact with the organizers immediately.

[22-nov-14] English ‘companion’ syntactic analyses in various dependency formats are now available, for use in the open and gold tracks.

[20-nov-14] We have completed the production of cross-lingual training data: some 31,000 PAS graphs for Chinese and some 42,000 PSD graphs for Czech. At the same time, we have prepared an update of the English training data, with somewhat better coverage and a few improved analyses in DM, as well as with additional re-entrancies (corresponding to grammatical control relations) in PSD. The data is available for download as Version 1.1 from the LDC. Owing to the delayed availability of the cross-lingual data, we have moved the closing date for the evaluation period to mid-January 2015.

[14-nov-14] An update to the SDP toolkit (now hosted at GitHub) is available, implementing the additional evaluation metrics ‘complete predicates’ and ‘semantic frames’.

[05-aug-14] We are (finally) ready to officially ‘launch’ SDP 2015: the training data is now available for distribution through the LDC; please register for SemEval 2015 Task 18, and within a day (or so) we will be in touch about data licensing and access information.

[03-aug-14] Regrettably, we are running late in making available the training data and technical details of the 2015 task setup; please watch this page for updates over the next couple of days!

[01-jun-14] We have started to populate the task web pages, including some speculative information on extensions (compared to the 2014 variant of the task) that we are still discussing. A first sample of trial data is available for public download.