SemEval-2015 Task 13: Multilingual All-Words Sense Disambiguation and Entity Linking
Introduction
The automatic understanding of the meaning of text has been a major goal of research in computational linguistics and related areas for several decades, with ambitious challenges, such as Machine Reading (Etzioni, 2006) and the quest for knowledge (Schubert, 2006). Two key Natural Language Processing tasks that need to be tackled as steps towards achieving the goal of automatic understanding of text are Word Sense Disambiguation (WSD) and Entity Linking (EL). WSD (Navigli, 2009) is a historical task aimed at explicitly assigning meanings to single-word and multi-word occurrences within text, a task which is today more alive than ever in the research community. EL (Erbs et al., 2011; Cornolti et al., 2013; Rao et al., 2013) is a more recent task which aims at discovering mentions of entities within a text and linking them to the most suitable entry in a knowledge base. The two main differences between WSD and EL lie in the kind of inventory used, i.e., dictionary vs. encyclopedia, and the assumption that the mention is complete or potentially partial, respectively. For instance, a named entity such as “European Medicines Agency” may be referred to within a text as simply “Medicines Agency”, the meaning of which, however, can be inferred thanks to the context. Notwithstanding these differences, the tasks are pretty similar in nature, in that they both involve the disambiguation of textual fragments according to a reference inventory. However, the research community has hitherto tended to tackle the two tasks separately, often duplicating efforts and solutions.
In contrast to this trend, research in knowledge acquisition is heading towards the seamless integration of encyclopedic and lexicographic knowledge within structured language resources (Hovy et al., 2013), and the main representative of this new direction is undoubtedly BabelNet (http://babelnet.org) (Navigli and Ponzetto, 2012). Therefore these resources seem to provide a common ground for the two tasks of WSD and EL. Only very recently a joint approach, called Babelfy (http://babelfy.org), has been proposed for both the tasks of WSD and EL (Moro et al., 2013).
Task description
In this task, our goal is to promote research in the direction of joint word sense and named entity disambiguation, so as to focus research efforts on the aspects that differentiate these two tasks without duplicating research for common problems within the two tasks. However, we will also allow systems that perform only one of the two tasks to participate, and even systems which tackle one particular setting of WSD, such as all-words sense disambiguation or on any subset of part-of-speech tags. Moreover, given the recent upsurge of interest in multilingual approaches, we will release our dataset in three different languages (English, Italian and Spanish) on parallel corpora which will be independently and manually annotated by different native/fluent speakers. In contrast to the SemEval-2013 task 12, Multilingual Word Sense Disambiguation (Navigli et al., 2013), our focus in this task is to present a dataset focused on both kinds of inventories (i.e., named entities and word senses) in the specific domain of biomedicine, in the attempt to further mitigate the distance between research efforts regarding the dichotomy EL vs. WSD and those regarding the dichotomy open domain vs. closed domain (i.e., biomedical Information Extraction). For this reason we encourage submissions from all these lines of research, in order that we can evaluate the distance between approaches that exploit both kinds of knowledge (i.e., lexicographic and encyclopedic) and approaches that work on both kinds of domain granularity (i.e., open and closed).
Word Senses and Named Entities inventory
The evaluation will use BabelNet 2.5, available at http://babelnet.org/ which contains Wikipedia pages (2012/10 dump), WordNet 3.0 synsets, OmegaWiki senses (2013/09 dump) and Open Multilingual WordNet synsets (2013/08 dump).
Input
Participating systems will be provided with a single file per language containing the considered documents in the following format:
<?xml version="1.0" encoding="UTF-8" ?>
<corpus lang="en">
<text id="d001">
<sentence id="d001.s001">
.
.
</sentence>
.
.
<sentence id="d001.s010">
<wf id="d001.s010.t001" pos="X">The</wf>
<wf id="d001.s010.t002" lemma="european" pos="J">European</wf>
<wf id="d001.s010.t003" lemma="medicine" pos="N">Medicines</wf>
<wf id="d001.s010.t004" lemma="agency" pos="N">Agency</wf>
<wf id="d001.s010.t005" pos="X">(</wf>
<wf id="d001.s010.t006" lemma="ema" pos="N">EMA</wf>
<wf id="d001.s010.t007" pos="X">)</wf>
<wf id="d001.s010.t008" lemma="be" pos="V">is</wf>
<wf id="d001.s010.t009" pos="X">a</wf>
<wf id="d001.s010.t010" lemma="european" pos="J">European</wf>
<wf id="d001.s010.t011" lemma="union" pos="N">Union</wf>
<wf id="d001.s010.t012" lemma="agency" pos="N">agency</wf>
<wf id="d001.s010.t013" pos="X">for</wf>
<wf id="d001.s010.t014" pos="X">the</wf>
<wf id="d001.s010.t015" lemma="evaluation" pos="N">evaluation</wf>
<wf id="d001.s010.t016" pos="X">of</wf>
<wf id="d001.s010.t017" lemma="medicinal" pos="J">medicinal</wf>
<wf id="d001.s010.t018" lemma="product" pos="N">products</wf>
<wf id="d001.s010.t019" lemma="." pos="X">.</wf>
</sentence>
.
.
</text>
<text id="d002">
.
.
</text>
</corpus>
Output
The participating systems will have to output a tab separated value file formatted as follows (both ids are inclusive!):
start_id TAB end_id TAB babelnet_synset_id|Wikipedia_page_title|Wordnet_sensekey
For instance, for the above example an answer file can contain the following lines:
d001.s010.t002 d001.s010.t004 wiki:European_Medicines_Agency
d001.s010.t003 d001.s010.t003 bn:00054128n
d001.s010.t008 d001.s010.t008 wn:be%2:42:03::
For each resource, the appropriate prefix must be used, wiki: for a Wikipedia annotation, bn: for a BabelNet annotation and wn: for a WordNet annotation, precisely as shown in the example above. The content of this file must be all lowercase, in UTF-8 format and all spaces within the third field must be replaced with underscores.
References
Bharath Dandala, Rada Mihalcea and Razvan Bunescu. 2013. Multilingual Word Sense Disambiguation Using Wikipedia. In Proc. of IJCNLP, pages 498-506.
Marco Cornolti, Paolo Ferragina, and Massimiliano Ciaramita. 2013. A framework for benchmarking entity-annotation systems. In Proc. of WWW, pages 249–260.
Nicolai Erbs, Torsten Zesch, and Iryna Gurevych. 2011. Link discovery: A comprehensive analysis. In Proc. of ICSC, pages 83–86.
Oren Etzioni, Michele Banko, and Michael J Cafarella. 2006. Machine Reading. In Proc. of AAAI, pages 1517–1519.
Eduard H. Hovy, Roberto Navigli, and Simone P. Ponzetto. 2013. Collaboratively built semi-structured content and Artificial Intelligence: The story so far. Artificial Intelligence, 194:2–27.
Andrea Moro, Alessandro Raganato, and Roberto Navigli. 2014. Entity Linking meets Word Sense Disambiguation: a Unified Approach. Transactions of the Association for Computational Linguistics.
Roberto Navigli. 2009. Word Sense Disambiguation: A survey. ACM Computing Surveys, 41(2):1–69.
Roberto Navigli, David Jurgens, and Daniele Vannella. 2013. SemEval-2013 Task 12: Multilingual Word Sense Disambiguation. In Proc. of SemEval-2013, pages 222–231.
Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250.
Delip Rao, Paul McNamee, and Mark Dredze. 2013. Entity Linking: Finding Extracted Entities in a Knowledge Base. In Multi-source, Multilingual Information Extraction and Summarization, Theory and Applications of Natural Language Processing, pages 93–115. Springer Berlin Heidelberg.
Lenhart K. Schubert. 2006. Turing’s dream and the knowledge challenge. In Proc. of NCAI, pages 1534–1538.