Data and Resources
Our first data set, containing English homographic puns, is the one described by Miller & Turković (2016) and Miller (2016). It contains punning and non-punning jokes, aphorisms, and other short texts sourced from professional humorists and online collections, and is licensed under the Creative Commons Attribution-Noncommercial (CC BY-NC) licence. The second data set of heterographic puns will be similar in size and scope, and is currently under construction. Both data sets will be provided in an XML format similar to that used in previous Senseval/SemEval WSD tasks.
For both data sets, the instances have the following characteristics:
- Each text contains a maximum of one pun.
- Each pun (and its latent target) contains exactly one content word (noun, verb, adjective, adverb) and zero or more non-content words such as prepositions. Here "word" is defined as a sequence of letters delimited by space or punctuation. This means that puns and targets do not include hyphenated words, and they do not consist of multi-word expressions containing more than one content word, such as "get off the ground" or "state of the art". Puns and targets may be multi-word expressions containing only one content word – this includes phrasal verbs such as "take off" or "put up with".
- Each pun (and its target) has a lexical entry in WordNet 3.1. However, the sense of the pun or the target may or may not exist in WordNet 3.1. (Instances where the pun or target meaning is missing from WordNet are removed from the dataset for Subtask 3.)
For Subtask 3, participants must apply senses from version 3.1 of WordNet, an electronic semantic network. However, they are not limited to the use of WordNet for this subtask, nor for any other subtasks. For all subtasks involving the second data set, participants may wish to make additional use of lexical-semantic resources that include pronunciation information, such as Wiktionary or the CMU Pronouncing Dictionary. This is because heterographic puns present an additional challenge to the interpretation process; in these puns the target (second meaning) has a different spelling and, usually, a different yet similar-sounding pronunciation.
Computational recovery of the target lexeme must be achieved not only via computational lexical semantics but through the application of some computational model of sound similarity. Implementations of general-purpose sound similarity models such as Soundex (Knuth, 1973:391–392) and Metaphone (Philips, 1990) are already widely available. Participants may alternatively wish to avail themselves of more sophisticated models developed for use with computational detection of cognates, surveys of which can be found in Kondrak (2002) and Kondrak & Sherif (2006), or with puns (Hempelmann, 2003). Machine-readable data for implementing Hempelmann's pun-based model of sound similarity will be made freely available to participants.
Download
- Trial and test data, scoring software, and results for all subtasks
- Data for Christian F. Hempelmann's pun-based model of sound similarity
Note that, due to the difficulty in amassing a large number of pun examples per word or per sense, there is no training data for this task. It is expected that this task will be most amenable to approaches that are unsupervised or knowledge-based rather than supervised.
References
- Hempelmann, Christian F. (2003). Paronomasic Puns: Target Recoverability Towards Automatic Generation. Ph.D. thesis. West Lafayette, IN: Purdue University, Aug. 2003.
- Knuth, Donald E. (1973). The Art of Computer Programming. Vol. 3. Addison-Wesley. ISBN: 978-0-201-03803-3.
- Kondrak, Grzegorz (2002). Algorithms for Language Reconstruction. Ph.D. thesis. University of Toronto.
- Kondrak, Grzegorz and Tarek Sherif (2006). “Evaluation of Several Phonetic Similarity Algorithms on the Task of Cognate Identification”. In: Proceedings of the COLING-ACL Workshop on Linguistic Distances. July 2006, pp. 43–50.
- Miller, Tristan (2016). Adjusting Sense Representations for Word Sense Disambiguation and Automatic Pun Interpretation. Dr.-Ing. thesis. Department of Computer Science, Technische Universität Darmstadt.
- Miller, Tristan and Iryna Gurevych (2015). “Automatic Disambiguation of English Puns”. In: The 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: Proceedings of the Conference (ACL–IJCNLP). Vol. 1. Stroudsburg, PA: Association for Computational Linguistics, July 2015, pp. 719–729. ISBN: 978-1-941643-72-3.
- Miller, Tristan and Mladen Turković (2016). “Towards the Automatic Detection and Identification of English Puns”. In: European Journal of Humour Research 4(1) (Jan. 2016), pp. 59–75. ISSN: 2307-700X.
- Philips, Lawrence (1990). “Hanging on the Metaphone”. In: Computer Language 7(12) (Dec. 1990).