Interactive Search

Motivation

For ease of interactive exploration and comparison between our three annotation formats, there is an experimental search interface (hosted at the University of Oslo).  Using a query-by-example approach, it is possible to retrieve instances of specific semantic phenomena, across different annotations, and inspect matching semantic dependency graphs graphically.

Query Language

The SDP search interface interprets a simple set of search operators, collectively dubbed the WeSearch Query Language (WQL).  By way of informal introduction, consider the following example query:

  /v*[ARG* x]
  quarterly[ARG1 x]
  x:+result

The query above is comprised of three predications, conventionally shown as one per line.  In this example, the following characters have operator status: ‘/’ (slash), ‘*’ (asterisk), ‘[’ and ‘]’ (left and right square bracket), ‘:’ (colon), and ‘+’ (plus sign).  This is a near-complete list of operator characters in WQL.  Each predication can be composed of (i) an identifier, followed by a colon if present; (ii) a form pattern; (iii) a lemma pattern, prefixed by a plus sign, if present; (iv) a part-of-speech (PoS) pattern, prefixed by a slash, if present; and (v) a list of arguments, enclosed in square brackets, if present.  Patterns can make use of Lucene-style wildcards, with the asterisk matching any number of characters, and a question mark (‘?’) to match a single character.

Argument specifications in WQL take the form of rolevalue pairs, where roles draw from a fixed inventory of pre-defined argument labels (specific to each annotation format), and values are predication identifiers defined in other parts of the query.  The role label and value are separated by whitespace, and multiple arguments can be specified within the list by using a comma (‘,’) as the separator.  In role labels, wildcards can be used just like in other query fields.

Thus, our example query above searches for a verbal predicate (any PoS tag starting in ‘v’), that takes any form of the lemma ‘result’ as its argument (this query is designed for the DM annotation format, where regular argument relations take the form ARG1 ... ARGn).  The query processor will ensure a one-to-one correspondence between query elements and matching graph elements, i.e. multiple distinct query components cannot match against the same target (graph component), or vice versa.  Lemma and PoS patterns, as well as role labels, are not case-sensitive.

In addition to the query proper, the search interface provides a set of radio buttons to select which of the three annotation formats to query; this selection can have implications for the matching of format-specific properties (e.g. lemmas) and for the interpretation of underspecified role labels (see below).  It is possible to search multiple formats in parallel (all three are active by default), and independent of the active set of formats for the search, annotations in all formats will always be presented for inspection for the items that matched the query.

The result page uses a tabbed display organization, aiming to make it easy to switch between annotation formats and graph or tabular display of matching items.  Color highlighting is used to indicate which parts of each result structure were matched by corresponding components of the query; as there can be more than one match in a single result, the interface allows ‘cycling through’ individual matches, one by one.

Boolean Connectives

In our example query above, the individual predications are implicitly conjoined, i.e. all three need to be matched against a candidate result graph for the query to be satisfied (formally, one might say that the whitespace separating predications serves as a conjunction operator).  Albeit with somewhat mixed feelings, we further experiment with additional boolean connectives in WQL, viz. negation (‘!’, exclamation point) and disjunction (‘|’, vertical bar); to complement these logical operators, parentheses (‘(’ and ‘)’) can be used to group expressions, to make explicit or override the scoping of logical operators.  By default, negation and conjunction bind stronger (i.e. scope narrowly) than disjunction (which scopes widely, i.e. at the top level or within an enclosing logical group).

More Examples

Following is a more complex example, searching for object equi verbs and taking advantage of an underspecified role label:

  [ARG2 x, ARG* e]
  e:/v*[ARG1 x]

A similar effect, requiring the ‘downstairs’ predicate to be any type of argument (within certain assumptions about the applicable range of role labels) to the ‘upstairs’ one, could instead be achieved using a disjunctive statement (note the need for logical grouping of the two disjuncts, in relation to the conjunction):

  ( [ARG2 x, ARG3 e] | [ARG2 x, ARG4 e] )
  e:/v*[ARG1 x]

The following query demonstrates the use of the top operator (‘^’), to retrieve graphs rooted in a coordinate structure, i.e. where the top node has an outgoing dependency matching the pattern ‘_*_c’ (again, assuming the DM format); here, specification of the role value can be omitted, as there is no predication constraining the argument node:

  ^[_*_c]

As an example of the (experimental) use of negation to filter candidate results, the following query will match occurences of verbal nodes that have no outgoing or incoming argument links:

  x:/v*
  !x:[* y]
  ![* x]

However, in mid-December 2013, the definition and implementation of boolean operators in WQL to some degree is still work in progress.

Full List of Operators

  • ^ (caret), constrains the node to be a top node (must be predication-initial);
  • : (colon), separates optional node identifier from node content;
  • [ and ] (left and right square brackets), separate outgoing arcs;
  •   (whitespace), separates role labels and values in list of arcs;
  • , (comma), separates role–value pairs within list of outgoing arcs;
  • + (plus sign), indicates (optional) lemma object property;
  • / (slash), indicates (optional) pos property;
  • ? (question mark), Lucene-style single-character wildcard;
  • * (asterisk), Lucene-style arbitrary sub-string wildcard;
  • ( and ) (left and right square parentheses), group sub-expressions (see below);
  • | (vertical bar), logical disjunction of predications or groups;
  • ! (exclamation mark), reserved for negation (must precede a predication or logical group);
  • \ (backslash), escape character, suppress operator status for any of the above.

Contact Info

Organizers

  • Dan Flickinger
  • Jan Hajič
  • Marco Kuhlmann
  • Yusuke Miyao
  • Stephan Oepen
  • Yi Zhang
  • Daniel Zeman

sdp-organizers@emmtee.net

Other Info

Announcements

[22-apr-14] Complete results (system submissions and official scores) as well as the gold-standard test data are now available for public download.

[31-mar-14] We have received submissions from nine teams; a draft summary of evaluation results has been emailed to participating teams.

[25-mar-14] We have posted some additional, task-specific instructions for how to submit system results to the SemEval evaluation; please make sure to follow these requirements carefully.

[22-mar-14] The test data (and corresponding ‘companion’ syntactic analyses, for use in the open track) are now available to registered participants; please see the task mailing list for details.

[08-mar-14] We have released a minor update to the companion archive, adding a handful of missing dependencies and fixing a problem in the file format.

[05-feb-14] We have posted the description of a baseline approach and experimental results on the suggested development sub-set of our training data (Section 20) on the evaluation page; on the same page, we have further specified the mechanics of submitting results to the evaluation.

[17-jan-14] Version 1.0 of the ‘companion’ data for the open track is now available, providing syntactic analyses (in phrase structure and bi-lexical dependency form) as overlays to our training data.  Please see the file README.txt in the companion archive for details.

[13-jan-14] We are releasing an update to the training data today, making a number of minor improvements to the DM and PCEDT graphs; also, we are now providing an on-line interface to search and explore visually the target representations for this task.  For details, please see our task-specific mailing list.

[12-dec-13] Some 750,000 tokens of WSJ text, annotated in our three semantic dependency formats will become available for download tomorrow.  To obtain the data, prospective participants need to enter a no-cost evaluation license with the Linguistic Data Consortium (LDC).  For access to the license form, please subscribe to our spam-protected mailing list.  Next, we are working to prepare our syntactic ‘companion’ data (to serve as optional input in the open track), which we expect to release in early January.

[24-nov-13] Version 1.1. of the trial data is now available, adding missing lemma values and streamlining argument labels in the DM format, removing a handful of items that used to have empty graphs in PAS, and generally aligning all items at the level of individual tokens (leaving 189 sentences in our trial data).  This last move means that all three formats now uniformly use single-character Unicode glyphs for quote marks, dashes, ellipses, and apostrophes (rather than multi-character LaTeX-style approxmiations, as were used in the original ASCII release of the text).  Furthermore, we encourage all interested parties, including prospective participants, to subscribe to our spam-protected mailing list, where we will post updates a little more frequently than on the general task web site.

[07-nov-13] We have clarified the interpretation of the top column (and renamed it from the earlier root) and elaborated the discussion of graph properties in the various formats.  We will continue to extend and revise the documentation on our three types of dependency graphs, but only announce such incremental changes here when they affect the data format.

[04-nov-13] A 198-sentence subset of what will be the training data has been released as trial data, to exemplify the file format and type of annotations available.  Please do get in touch, in case you see anything suprising!

[28-oct-13] We are in the process of finalizing the task description, posting some example dependencies, and making available some trial data.  For the time being, please consider these pages very much a work in progress, i.e. contents and form will be subject to refinement over the next few days

.