Data and Tools

Overview

For this task, there will be three data sets: training, development, and test.  On June 1, 2014, a sample (of some 190 dependency graphs) from the training data was published as trial data, demonstrating key characteristics of the task.  Since August 5, some 750,000 tokens of annotated text have been available as training data; please subscribe to the task mailing list for access information.  In November 2014, this data was re-released with minor improvements, and complemented with comparable volumes of Chinese and Czech training data.  Participants are free to use the training and development data in system development as they see fit, i.e. our splitting off part of the data as a development set is no more than a suggestion on best practises; in particular, it will be legitimate to train the final system, for submission to evaluation once the test data is released, on both the training and development parts.

Data Format

All data provided for this task will be in a format similar to the one used at the 2009 Shared Task of the Conference on Computational Language Learning (CoNLL), though with some simplifications.  In a nutshell, our files are pre-tokenized, with one token per line.  All sentences are terminated by an empty line (i.e. two consecutive newlines, including following the last sentence in each file).  Each line is comprised of at least five tab-separated fields, i.e. annotations on tokens.

For ease of reference, each sentence is prefixed by a line that is not in tab-separated form and starts with the character # (ASCII number sign; U+0023), followed by a unique eight-digit identifier.  Our sentence identifiers use the scheme 2SSDDIII, with a constant leading 2, two-digit section code, two-digit document code (within each section), and three-digit item number (within each document).  For example, identifier 20200002 denotes the second sentence in the first file of Section 02, the classic Ms. Haag plays Elianti.

With one exception, our fields (i.e. columns in the tab-separated matrix) are a subset of the CoNLL 2009 inventory: (1) id, (2) form, (3) lemma, and (4) pos characterize the current token, with token identifiers starting from 1 within each sentence.  Besides the lemma and part-of-speech information, in the closed track of our task, there is no explicit analysis of syntax.  Across the three annotation formats in the task, fields (1) and (2) are aligned and uniform, i.e. all formats annotate exactly the same sentences.  On the other hand, fields (3) and (4) are format-specific, i.e. there are different conventions for lemmatization, and part-of-speech assignments can vary (but all formats use the same PTB inventory of PoS tags).

The bi-lexical semantic depedency graph over tokens is represented by three or more columns starting with the obligatory fields (5) top, (6) pred, and (7) frame.  The first two of these are binary-valued, i.e. possible values are ‘+’ (ASCII plus; U+002b) and ‘-’ (ASCII minus; U+002d).  A positive value in the top column indicates that the node corresponding to this token is either a (semantic) head or a (structural) root of the graph; the exact linguistic interpretation of this property differs for our three representations, but note that top nodes can have incoming dependency edges.  The pred column is a simplification of the corresponding field in earlier CoNLL tasks, indicating whether or not this token represents a predicate, i.e. a node with outgoing dependency edges.  Finally, the sense field provides optional frame (or sense) information, for example the distinction between causative vs. inchoative predicates (like increase).  The three target representations differ in their use and inventory of frame distinctions: DM provides this information for all content words, PAS for none, and PSD only for verbs (using sense rather than frame identifiers).

With these minor differences to the CoNLL tradition, our format can represent general, directed graphs, with designated top nodes and optional predicate senses.  For example, there can be singleton nodes not connected to other parts of the graph (representing semantically vacuous tokens).  In principle, there can be multiple tops, or a non-predicate top node, although in our actual task data, we anticipate that there will typically be one unique top.

To designate predicate–argument relations, there are as many additional columns as there are predicates in the graph (i.e. the number of tokens marked ‘+’ in the pred column); we will call these additional columns (8) arg1, (9) arg2, etc.  These colums contain argument roles relative to the i-th predicate, i.e. a non-empty value in column arg1 indicates that the current token is an argument of the (linearly) first predicate in the sentence.  In this format, graph reentrancies will lead to one token receiving argument roles for multiple predicates (i.e. non-empty argi values in the same row).  By convention, empty values are represented as ‘_’ (ASCII underscore; U+005F), which indicates that there is no argument relation for the current token, with regard to the i-th predicate represented by this column.  Thus, all tokens of the same sentence must always have all argument columns filled in, even on non-predicate words; in other words, all lines making up one block of tokens will have the same number n of fields, but n can differ across sentences, depending on the count of internal graph nodes.

Following is an example for the sentence Ms. Haag plays Elianti (using the DM target representation).

id form lemma pos top pred frame arg1 arg2
#20200002
1 Ms. Ms. NNP - + n:x _ _
2 Haag Haag NNP - - named:x compound ARG1
3 plays play VBZ + + v:e-i-p _ _
4 Elianti Elianti NNP - - named:x _ ARG2
5 . . . - - _ _ _
 

In the training and development data, all columns are provided. In the test data, only columns (1) to (4) are pre-filled.  Participating systems will be asked to add columns (5), (6), (7), and upwards and submit their results for scoring.

Companion Data

In the open and gold tracks of the task (see the evaluation rules for details), we expect participants to draw on additional tools or resources, beyond the training data provided—notably on syntactic parsing.  To aid participation in the open track, and for potentially increased comparability of results, we will make available a set of ‘companion’ data files, providing syntactic analyses from state-of-the-art data-driven parsers.  Once the task enters its evaluation phase, the same range and format of syntactic analyses will be provided as companion files for the test data.

We are still discussing exactly how many such syntactic views on our data to prepare, but we plan on providing at least one dependency and one phrase-structure view, i.e. (a) analyses from the parser of Bohnet & Nivre (2012), with bi-lexical syntactic dependencies in the so-called Stanford Basic scheme (de Marneffe, et al., 2006) and (b) PTB-style constituent trees as produced, for example, by the parsers of Charniak & Johnson (2005) and Petrov & Klein (2007).  Our companion data will be distributed in a token-oriented, tab-separated form (very similar to formats used at previous CoNLL Shared Tasks on data-driven dependency parsing and semantic role labeling), which will be aligned at the sentence and token levels to our official training and test data files and, thus, can be viewed as augmenting these files with additional columns for explicit syntactic information.

Licensing and Distribution

Large parts of the data prepared for this task is derivative of the PTB and other resources distributed by the Linguistic Data Consortium (LDC).  We have established an agreement with the LDC that will make it possible for all task participants to obtain our training, development, and test data free of charge (for use in connection with SemEval 2015), whether they are LDC members or not.  Participants will need to enter a license agreement with the LDC (which will be provided to registered participants and needs to be signed and submitted electronically to the Task organizers), and will then be able to download the data.  Please subscribe to the mailing list for this task, for further information on obtaining the task data.

Trial Data

We have prepared the first 20 documents from Section 00 of the PTB WSJ Corpus, instantiating the file format described above and the three types of semantic dependencies used in this task.  This trial data has been available for public download since Sunday, June 2, 2014.

Software Support

We are currently putting together a supporting SDP toolkit, essentially a reference implementation (in Java) of input and output to the task file format, some quantitative analysis of semantic dependency graphs, and the official task scorer.

Contact Info

Organizers

  • Dan Flickinger
  • Jan Hajič
  • Angelina Ivanova
  • Marco Kuhlmann
  • Yusuke Miyao
  • Stephan Oepen
  • Daniel Zeman

sdp-organizers@emmtee.net

Other Info

Announcements

[06-feb-15] Final evaluation results for the task are now available; we are grateful to all (six) participating teams.

[08-dec-15] The evaluation period is nearing completion; we have purged inactive subscribers from the task-specific mailing list and sent out important information on the submssion of system outputs for evaluation to the list; if you have not received this email but are actually preparing a system submission, please contact the organizers immediately.

[17-dec-14] We are about to enter the evaluation phase, but recall that the closing date has been extended to Thursday, January 15, 2015. We have sent important instructions on how to participate in the evaluation to the task-specific mailing list; if you plan on submitting system results to this task but have not seen these instructions, please make contact with the organizers immediately.

[22-nov-14] English ‘companion’ syntactic analyses in various dependency formats are now available, for use in the open and gold tracks.

[20-nov-14] We have completed the production of cross-lingual training data: some 31,000 PAS graphs for Chinese and some 42,000 PSD graphs for Czech. At the same time, we have prepared an update of the English training data, with somewhat better coverage and a few improved analyses in DM, as well as with additional re-entrancies (corresponding to grammatical control relations) in PSD. The data is available for download as Version 1.1 from the LDC. Owing to the delayed availability of the cross-lingual data, we have moved the closing date for the evaluation period to mid-January 2015.

[14-nov-14] An update to the SDP toolkit (now hosted at GitHub) is available, implementing the additional evaluation metrics ‘complete predicates’ and ‘semantic frames’.

[05-aug-14] We are (finally) ready to officially ‘launch’ SDP 2015: the training data is now available for distribution through the LDC; please register for SemEval 2015 Task 18, and within a day (or so) we will be in touch about data licensing and access information.

[03-aug-14] Regrettably, we are running late in making available the training data and technical details of the 2015 task setup; please watch this page for updates over the next couple of days!

[01-jun-14] We have started to populate the task web pages, including some speculative information on extensions (compared to the 2014 variant of the task) that we are still discussing. A first sample of trial data is available for public download.