SemEval-2015 Task 15: CPA

Overview

Corpus Pattern Analysis (CPA, [Hanks2013]) is a new technique of language analysis, which identifies the main patterns in which word are used in text.

This task focuses on the current output of CPA (work in progress): the Pattern Dictionary of English Verbs (PDEV), a lexical resource which can be browsed here. Contrary to most semantic resources, PDEV starts by analysing corpus data, rather than by speculating about possible meanings; as a general rule, only patterns found in the text samples are listed. Each pattern specifies a contextual environment in which the verb is used. In a second step, each pattern is mapped onto a "primary implicature" (which is similar to a "definition" in a traditional dictionary). This includes (among other things) the argument structure (subject, object, complement, adverbial) and the semantic type shared by a set of lexical items in each argument slot, taken from a corpus-based shallow semantic ontology (see here; for more details, see [Hanks2013]).

Recent work on PDEV ([El-Maarouf14], [El-Maarouf13], [Popescu14], [Popescu12]) suggests that this resource can be a valuable asset for the NLP community. The goal of this task is to break down the different levels of analysis required to build a dictionary, and to propose each of them as steps that NLP systems can tackle separately. Three main sub-tasks have been identified:

  • CPA parsing: all sentences in the dataset must be syntactically and semantically analysed.
  • CPA clustering: all sentences in the dataset must be compared and grouped according to their similarities.
  • CPA lexicography: all verbs in the dataset must be described with a list of patterns.

Each sub-task can be evaluated separately and participants are encouraged to design systems which can successfully tackle all three subtasks. For this reason, all tasks will be evaluated on the same verbs.

Subtask 1 — CPA parsing

Task description

Systems participating in the first sub-task will identify the main arguments of the verb and tag them with semantic types (Human, Process, Property, etc.). The task is similar to Semantic Role Labelling ([Carreras04]) except that arguments will be identified in the dependency parsing paradigm ([Buchholz06]).

The tagset used is minimal for the syntax layer, and based on the CPA Semantic Ontology (Access) for the semantic layer.

This task includes two setups: the first subset of verbs is provided without the identication of the number of patterns in the gold standard, and the second provides this information.

Subtask 2 — CPA Clustering

Task description

In the first task, participating systems must identify syntactic arguments and their semantic type. By doing so, they can discover patterns and regularities shared by verb instances. The aim of the CPA Clustering task is to evaluate the ability of systems to discover similarities and to cluster the most similar sentences together. Similarity here means that two sentences belong to the same CPA pattern.

The task includes two setups: the first subset of verbs is provided without the identication of the number of patterns in the gold standard, and the second provides this information.

Subtask 3 — CPA Automatic Lexicography (AutLex)

Task description

This task aims to evaluate how systems can approach as much as possible the design of a lexicographical entry using the CPA framework. The data has been simplified to a form which is more tractable by systems while still being a relevant representation from the lexicographical point of view.

CPA patterns indicate the collocational, syntactic and semantic preferences of major uses of a word. A pattern like

[[Human | Institution]] battle [NO OBJ] {against [[Anything = Problem]]}

means that:

  • the subject can be of the semantic type Human or Institution (semantic alternation),
  • that no direct object should occur,
  • and that it should be followed by a prepositional adverbial made of a preposition 'against' and a complement of any semantic type (Anything), which has the contextual role of "Problem".

This task simplifies this pattern to a set of major pattern elements: the syntactic and semantic structures. The previous pattern is simplified to:

Human|Institution battle against Anything

This task includes two setups: the first subset of verbs is provided without the identification of the number of patterns in the gold standard, and the second provides this information.

 

References

  • [Bradbury13] Bradbury, Jane and El Maarouf, Ismail. 2013. "An empirical classification of verbs based on Semantic Types: the case of the 'poison' verbs". In Proceedings of JSSP.
  • [Buchholz06] Buchholz, Sabine and Marsi, Erwin. 2006. "CoNLL-X shared task on multilingual dependency parsing". In Proceedings of CoNLL, New York.
  • [Carreras04] Carreras, Xavier and Marquez, Lluis. 2004. "Introduction to the CoNLL-2004 shared task: Semantic role labeling. In Proceedings of CoNLL, Boston.
  • [Hanks13] Hanks, Patrick. 2013. Lexical Analysis: Norms and Exploitations. Cambridge, MA:MIT Press.
  • [Hanks05] Hanks, Patrick, and Pustejovsky, James. 2005. "A Pattern Dictionary for Natural Language Processing" in Revue Française de linguistique appliquée, 10:2.
  • [El-Maarouf13] El Maarouf, Ismail and Baisa, Vít. 2013. "Automatic classification of semantic patterns from the Pattern Dictionary of English Verbs". In Proceedings of JSSP.
  • [El-Maarouf14] El Maarouf, Ismail and Bradbury, Jane and Baisa, Vít and Hanks, Patrick. 2014. "Disambiguating Verbs by Collocation: Corpus Lexicography meets Natural Language Processing", in Proceedings of LREC, Reykjavik.
  • [Popescu12] Popescu, Octavian. 2012. "Building a Resource of Patterns Using Semantic Types". In Proceedings of  LREC, Istanbul.
  • [Popescu14] Popescu, Octavian and Palmer, Martha and Hanks, Patrick. 2014. "Mapping CPA onto OntoNotes Senses". In Proceedings of LREC, Reykjavik.
  • [Pustejovsky04] Pustejovsky, James and Hanks, Patrick and Rumshisky, Anna. 2004. "Automated Induction of Sense in Context". COLING 2004.Geneva, Switzerland.

Contact Info

  • Vít Baisa (Masaryk University, Brno, CZ),
  • Jane Bradbury (University of Wolverhampton, UK),
  • Ismaïl El Maarouf (University of Wolverhampton, UK),
  • Patrick Hanks (University of Wolverhampton, UK),
  • Adam Kilgarriff (Lexical Computing Ltd, UK),
  • Octavian Popescu (FBK, Trento, IT)

Announcements

  • September, 29th: Train data has been updated!
  • August, 19th: Train data has been released!
  • June, 3rd: Trial data has been released!
  • June, 5th: Google group for discussion has been created, you can send us an email, use the address: semeval2015task15@googlegroups.com