Data and Tools

Overview

For this task, there will be three data sets: training, development, and test. On June 1, 2014, a sample (of some 190 dependency graphs) from the training data was published as trial data, demonstrating key characteristics of the task. Since August 5, some 750,000 tokens of annotated text have been available as training data; please subscribe to the task mailing list for access information. In November 2014, this data was re-released with minor improvements, and complemented with comparable volumes of Chinese and Czech training data. Participants are free to use the training and development data in system development as they see fit, i.e. our splitting off part of the data as a development set is no more than a suggestion on best practises; in particular, it will be legitimate to train the final system, for submission to evaluation once the test data is released, on both the training and development parts.

Data Format

All data provided for this task will be in a format similar to the one used at the 2009 Shared Task of the Conference on Computational Language Learning (CoNLL), though with some simplifications. In a nutshell, our files are pre-tokenized, with one token per line. All sentences are terminated by an empty line (i.e. two consecutive newlines, including following the last sentence in each file). Each line is comprised of at least five tab-separated fields, i.e. annotations on tokens.

For ease of reference, each sentence is prefixed by a line that is not in tab-separated form and starts with the character # (ASCII number sign; U+0023), followed by a unique eight-digit identifier. Our sentence identifiers use the scheme 2SSDDIII, with a constant leading 2, two-digit section code, two-digit document code (within each section), and three-digit item number (within each document). For example, identifier 20200002 denotes the second sentence in the first file of Section 02, the classic Ms. Haag plays Elianti.

With one exception, our fields (i.e. columns in the tab-separated matrix) are a subset of the CoNLL 2009 inventory: (1) id, (2) form, (3) lemma, and (4) pos characterize the current token, with token identifiers starting from 1 within each sentence. Besides the lemma and part-of-speech information, in the closed track of our task, there is no explicit analysis of syntax. Across the three annotation formats in the task, fields (1) and (2) are aligned and uniform, i.e. all formats annotate exactly the same sentences. On the other hand, fields (3) and (4) are format-specific, i.e. there are different conventions for lemmatization, and part-of-speech assignments can vary (but all formats use the same PTB inventory of PoS tags).

The bi-lexical semantic depedency graph over tokens is represented by three or more columns starting with the obligatory fields (5) top, (6) pred, and (7) frame. The first two of these are binary-valued, i.e. possible values are ‘+’ (ASCII plus; U+002b) and ‘-’ (ASCII minus; U+002d). A positive value in the top column indicates that the node corresponding to this token is either a (semantic) head or a (structural) root of the graph; the exact linguistic interpretation of this property differs for our three representations, but note that top nodes can have incoming dependency edges. The pred column is a simplification of the corresponding field in earlier CoNLL tasks, indicating whether or not this token represents a predicate, i.e. a node with outgoing dependency edges. Finally, the sense field provides optional frame (or sense) information, for example the distinction between causative vs. inchoative predicates (like increase). The three target representations differ in their use and inventory of frame distinctions: DM provides this information for all content words, PAS for none, and PSD only for verbs (using sense rather than frame identifiers).

With these minor differences to the CoNLL tradition, our format can represent general, directed graphs, with designated top nodes and optional predicate senses. For example, there can be singleton nodes not connected to other parts of the graph (representing semantically vacuous tokens). In principle, there can be multiple tops, or a non-predicate top node, although in our actual task data, we anticipate that there will typically be one unique top.

To designate predicate–argument relations, there are as many additional columns as there are predicates in the graph (i.e. the number of tokens marked ‘+’ in the pred column); we will call these additional columns (8) arg₁, (9) arg₂, etc. These colums contain argument roles relative to the i-th predicate, i.e. a non-empty value in column arg₁ indicates that the current token is an argument of the (linearly) first predicate in the sentence. In this format, graph reentrancies will lead to one token receiving argument roles for multiple predicates (i.e. non-empty arg_i values in the same row). By convention, empty values are represented as ‘_’ (ASCII underscore; U+005F), which indicates that there is no argument relation for the current token, with regard to the i-th predicate represented by this column. Thus, all tokens of the same sentence must always have all argument columns filled in, even on non-predicate words; in other words, all lines making up one block of tokens will have the same number n of fields, but n can differ across sentences, depending on the count of internal graph nodes.

Following is an example for the sentence Ms. Haag plays Elianti (using the DM target representation).

`id`	`form`	`lemma`	`pos`	top	pred	frame	`arg₁`	`arg₂`
#20200002
1	Ms.	Ms.	NNP	-	+	n:x	_	_
2	Haag	Haag	NNP	-	-	named:x	compound	ARG1
3	plays	play	VBZ	+	+	v:e-i-p	_	_
4	Elianti	Elianti	NNP	-	-	named:x	_	ARG2
5	.	.	.	-	-	_	_	_

In the training and development data, all columns are provided. In the test data, only columns (1) to (4) are pre-filled. Participating systems will be asked to add columns (5), (6), (7), and upwards and submit their results for scoring.

Companion Data

In the open and gold tracks of the task (see the evaluation rules for details), we expect participants to draw on additional tools or resources, beyond the training data provided—notably on syntactic parsing. To aid participation in the open track, and for potentially increased comparability of results, we will make available a set of ‘companion’ data files, providing syntactic analyses from state-of-the-art data-driven parsers. Once the task enters its evaluation phase, the same range and format of syntactic analyses will be provided as companion files for the test data.

We are still discussing exactly how many such syntactic views on our data to prepare, but we plan on providing at least one dependency and one phrase-structure view, i.e. (a) analyses from the parser of Bohnet & Nivre (2012), with bi-lexical syntactic dependencies in the so-called Stanford Basic scheme (de Marneffe, et al., 2006) and (b) PTB-style constituent trees as produced, for example, by the parsers of Charniak & Johnson (2005) and Petrov & Klein (2007). Our companion data will be distributed in a token-oriented, tab-separated form (very similar to formats used at previous CoNLL Shared Tasks on data-driven dependency parsing and semantic role labeling), which will be aligned at the sentence and token levels to our official training and test data files and, thus, can be viewed as augmenting these files with additional columns for explicit syntactic information.

Licensing and Distribution

Large parts of the data prepared for this task is derivative of the PTB and other resources distributed by the Linguistic Data Consortium (LDC). We have established an agreement with the LDC that will make it possible for all task participants to obtain our training, development, and test data free of charge (for use in connection with SemEval 2015), whether they are LDC members or not. Participants will need to enter a license agreement with the LDC (which will be provided to registered participants and needs to be signed and submitted electronically to the Task organizers), and will then be able to download the data. Please subscribe to the mailing list for this task, for further information on obtaining the task data.

Trial Data

We have prepared the first 20 documents from Section 00 of the PTB WSJ Corpus, instantiating the file format described above and the three types of semantic dependencies used in this task. This trial data has been available for public download since Sunday, June 2, 2014.

Software Support

We are currently putting together a supporting SDP toolkit, essentially a reference implementation (in Java) of input and output to the task file format, some quantitative analysis of semantic dependency graphs, and the official task scorer.

SemEval-2015 Task 18

Data and Tools

Overview

Data Format

Companion Data

Licensing and Distribution

Trial Data

Software Support

Contact Info

Organizers

Other Info

Announcements