Data and Resources

Our first data set, containing English homographic puns, is the one described by Miller & Turković (2016) and Miller (2016). It contains punning and non-punning jokes, aphorisms, and other short texts sourced from professional humorists and online collections, and is licensed under the Creative Commons Attribution-Noncommercial (CC BY-NC) licence. The second data set of heterographic puns will be similar in size and scope, and is currently under construction. Both data sets will be provided in an XML format similar to that used in previous Senseval/SemEval WSD tasks.

For both data sets, the instances have the following characteristics:

  • Each text contains a maximum of one pun.
  • Each pun (and its latent target) contains exactly one content word (noun, verb, adjective, adverb) and zero or more non-content words such as prepositions.  Here "word" is defined as a sequence of letters delimited by space or punctuation.  This means that puns and targets do not include hyphenated words, and they do not consist of multi-word expressions containing more than one content word, such as "get off the ground" or "state of the art".  Puns and targets may be multi-word expressions containing only one content word – this includes phrasal verbs such as "take off" or "put up with".
  • Each pun (and its target) has a lexical entry in WordNet 3.1.  However, the sense of the pun or the target may or may not exist in WordNet 3.1.  (Instances where the pun or target meaning is missing from WordNet are removed from the dataset for Subtask 3.)

For Subtask 3, participants must apply senses from version 3.1 of WordNet, an electronic semantic network. However, they are not limited to the use of WordNet for this subtask, nor for any other subtasks. For all subtasks involving the second data set, participants may wish to make additional use of lexical-semantic resources that include pronunciation information, such as Wiktionary or the CMU Pronouncing Dictionary. This is because heterographic puns present an additional challenge to the interpretation process; in these puns the target (second meaning) has a different spelling and, usually, a different yet similar-sounding pronunciation.

Computational recovery of the target lexeme must be achieved not only via computational lexical semantics but through the application of some computational model of sound similarity. Implementations of general-purpose sound similarity models such as Soundex (Knuth, 1973:391–392) and Metaphone (Philips, 1990) are already widely available. Participants may alternatively wish to avail themselves of more sophisticated models developed for use with computational detection of cognates, surveys of which can be found in Kondrak (2002) and Kondrak & Sherif (2006), or with puns (Hempelmann, 2003). Machine-readable data for implementing Hempelmann's pun-based model of sound similarity will be made freely available to participants.


Note that, due to the difficulty in amassing a large number of pun examples per word or per sense, there is no training data for this task.  It is expected that this task will be most amenable to approaches that are unsupervised or knowledge-based rather than supervised.


Contact Info


Mailing list:

Other Info