Evaluation
Evaluation Set-Up
Systems participating in the task will be evaluated based on the accuracy with which they can produce semantic dependency graphs for previously unseen text, measured relative to the gold-standard testing data. The key measures for this evaluation will be labeled and unlabeled precision and recall with respect to predicted dependencies (predicate–role–argument triples) and labeled and unlabeled exact match with respect to complete semantic dependency graphs. In both contexts, identification of the top node(s) of a graph will be considered as the identification of additional, ‘virtual’ triples. Below and in other task-related contexts, we will abbreviate these metrics as (a) labeled precision, recall, and F1: LP, LR, LF; (b) unlabeled precision, recall, and F1: UP, UR, UF; and (c) labeled and unlabeled exact match: LM, UM.
More practically speaking, as the task enters the evaluation period in mid-March, we will make available three copies of the test data, one for each target annotation, in the same token-oriented, tab-separated format as the training data, but with only columns (1) id, (2) form, (3) lemma, and (4) pos pre-filled. Participating teams are expected to fill in the remaining columns (i.e. actual semantic dependency graphs) and submit the resulting files (one per target format) by the end of March. Even though our three annotation formats annotate the exact same text (i.e. are sentence- and token-aligned), we provide three instances of the test data, as there may be variation in lemmatization and PoS assignment (unlike PAS and PCEDT, the DM annotations did not build on gold-standard PoS tags from the PTB). For the open track (see below), we will further make available with the test data the same range of ‘companion’ syntactic analyses as are provided for the training data.
Sub-Tasks
For all three target formats, there will be two sub-tasks: a closed track and an open track. Systems participating in the closed track can only be trained on the gold-standard semantic dependencies distributed for the task. Systems participating in the open track may use additional resources, such as a syntactic parser, for example. Test data for our task will draw on Section 21 of the WSJ Corpus, and therefore participants must make sure to not use any tools or resources that encompass knowledge of the gold-standard syntactic or semantic analyses of this section, i.e. are directly or indirectly trained or otherwise derived from WSJ Section 21. Note that this restriction implies that off-the-shelf syntactic parsers may need to be re-trained, as many data-driven parsers for English include this section in their default training data. To simplify participation in the open track, in mid-January 2014, we have made available syntactic analyses from two state-of-the-art parsers (re-trained without use of WSJ Section 21) as optional ‘companion’ data files; please see the data overview page for details.
Multiple Runs
Each participating team will be allowed to submit up to two different runs of their system (for each target format and, where applicable, both the closed and open tracks). Separate runs could, for example, reflect different parameter settings or other relatively minor variation in the configuration of the system used to produce the submitted results. Where genuinely different approaches are pursued within one team, i.e. separate systems that build on different methods, it may be legitimate to split the team, i.e. have two separate ‘teams’ from one site. Please contact the organizers (at the email address indicated in the right column) if you feel your site might want to register as multiple teams.
Final Scoring
The ‘official’ ranking of participating systems, in both the closed and the open tracks, will be determined based on the arithmetic mean of the labeled dependency F1 scores (i.e. the geometric mean of labeled precision and labeled recall) on the three target representations (DM, PAS, and PCEDT). Thus, to be considered for the final ranking, a system must submit semantic dependencies for all three target representations.
Software Support
Since late January 2014, the official scorer is available to participants as part of the emerging SDP toolkit.
Baseline
As a common point of reference, the organizers have prepared a simple baseline system, building on techniques from data-driven syntactic dependency parsing. In a nutshell, we reduced the SDP graphs to trees. First, we eliminated re-entrancies in the graph by removing dependencies to nodes with multiple incoming edges, i.e. those that are the argument of more than one predicate. Of these edges, we kept the dependency on the ‘closest’ predicate, as defined in terms of surface distance (with a preference for leftward predicates over rightward ones, in case of ties by distance). Second, we trivially incorporated all singleton nodes into the tree, by attaching nodes with neither incoming nor outgoing edges to the immediately following node, or to a virtual new root node (token ‘0’) in case a sentence-final node was a singleton; these synthesized dependencies, we called ‘_null_’. Finally, we integrated all fragments into one tree, by subordinating any remaining node without incoming edges to the root node, using a new dependency type called ‘_root_’.
Following our recommended split of the training data, we then trained the graph-based parser of Bohnet (2010) on Sections 00–19 of the (tree reduction of our) SDP data, and applied the resulting ‘syntactic’ parsing model to Section 20. The table below indicates parser performance for our three target formats, evaluated both (a) at the level of Labeled and Unlabeled Attachment Scores (LAS and UAS, respectively; as computed by MaltEval), and (b) in terms of our SDP graph metrics, where for the latter the synthesized dependencies and any dependencies on the virtual root node were suppressed. Note that this baseline makes no attempt at predicting top nodes, but in keeping with our ‘official’ metrics for this task, our figures for LP, LR and, LF include the virtual edges to top nodes.
LAS | UAS | LP | LR | LF | GF | TP | |
---|---|---|---|---|---|---|---|
DM | 83.56 | 84.71 | 83.20 | 40.73 | 54.68 | 66.19 | 94.97 |
PAS | 84.73 | 85.52 | 88.34 | 35.74 | 50.89 | 57.66 | 97.37 |
PCEDT | 83.53 | 91.19 | 74.82 | 62.08 | 67.84 | 90.70 | 92.45 |
To put these results into perspective, the table above also includes two static measures of the ‘tree reductions’ of Section 20 of the SDP data, viz. the labeled F1 for the ‘gold’ trees scored as an SDP graph (GF), and their averaged per-token degree of tree projectivity (again, computed by MaltEval: TP).
Acknowledgements
We are grateful to Zeljko Agic for advice in designing the baseline, and for assistance in configuring MATE Tools and MaltEval.