Task Description < SemEval-2016 Task 14

Task Description

Task 14 aims to enrich the WordNet taxonomy with new words and word senses. For a word sense which is not already defined in the WordNet sense inventor, a system in this task has to identify either:

the WordNet synset that is a generalization of the new word sense (i.e., its hypernym), or
the WordNet synset whose word senses are synonyms to the new word sense.

To particiate in the task, a system is provided with a specific word sense, i.e., a word together with its definition. A system's task is to identify the WordNet synset to which the new word sense should be merged (i.e., the term is synonymous with those in the synset) or added as a hyponym (i.e., the new word sense is a specialization of an exisiting word sense).

Additionally, each team may submit two kinds of systems:

Resource-aware: the system can use any dictionary, including the one from which the target word sense has been obtained, e.g., Wiktionary.
Constrained: the system might use any resource other than dictionaries.

We allow up to three submissions per system type to allow teams to explore different configurations, features, or parameter settings in the officla rankings. If a team has multiple diferent systems of the same type (e.g., different software, very different resources, etc.) then teams may submit up to three submissions for each of these.

The following table gives examples of word senses that are not defined in WordNet and their corresponding operations, illustrating the type of data that might be seen in the task.

OOV word	Target synset	Operation
geoscience#n - Any of several sciences that deal with the Earth	earth_science -- (any of the sciences that deal with the earth or its parts)	MERGE
mudslide#n - a mixed drink consisting of vodka, Kahlua and Bailey's.	cocktail -- a short mixed drink	ATTACH
unilingual#a - knowing, or using a single language	monolingual -- (using or knowing only one language)	MERGE
euthanize#v - To submit or animal to euthanasia.	destroy, put down -- put (an animal) to death	MERGE
changing_room#n - A room, especially in a gym, designed for people to change their clothes.	dressing_room -- a room in which you can change clothes	MERGE
tensible#a - Capable of being extended or drawn out; ductile; tensible.	ductile, malleable, pliable, pliant, tensile, tractile -- (capable of being shaped or bent or drawn out)	MERGE
Apple#n - an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics, computer software, online services, and personal computers.	corporation, corp -- (a business firm whose articles of incorporation have been approved in some state)	ATTACH

Evaluation

Systems will be evaluated according to two criteria:

their ability to correctly identify the attachment/merge point in the WordNet hierarchy for a new sense,
the percentage of items that are able to be attached or merged.

For the first criteria, a system’s automatically-made attachment to the WordNet hierarchy is expected to be as close as possible to the correct attachment point given by the gold-standard data. The participating systems are evaluated in an in-vitro framework in which the performance is measured as a function of the distance between the correct attachment point and the one output by the system.

We will use Wu and Palmer’s (WuP) similarity measure defined as: 2 * DepthLCS / ( Depth1 + Depth2 ), where Depth1 and Depth2 are the depths of the two concepts in WordNet’s subsumption hierarchy (hypernymy/hyponymy relations) and DepthLCS is the depth of the their least common subsumer, i.e., the most specific concept which is an ancestor of both the concepts. For each instance in the test set, we evaluate the participating systems for the WuP similarity of their output attachment and the corresponding correct point. A good-performing attachment system is expected to have a high overall similarity score, aggregated over all instances in the test set.

The second criteria is intended to recognize that some items may be more difficult to place in the WordNet hierarchy than others due to a variety of reasons. Therefore, we allow a system to decline to place these senses in order to avoid reporting placements that it believes would be inaccurate. As an evaluation metric, we report recall as the percentage of items that were placed.

Using the two metrics proposed above, WuP and Recall, systems will also be evaluated on their performance with different categories of terms (e.g., technical, slang, named entities) and on their ability to distinguish between MERGE and ATTACH operations for a new sense. A final ranking of all teams’ systems will be computed by the F1 score of WuP and Recall.

We will provide an official scorer for the task. The official evaluation and ranking of the systems will be for the "resource-aware" system type.

SemEval-2016 Task 14

Task Description

Evaluation

Contact Info

Organizers

Other Info

Announcements