Data and Tools < SemEval-2017 Task 2

Data and Tools

Download both trial and test data (including gold keys, uploaded Feb 9, 2017)

NEW! Given some requests you can additionally download a non-official subset of all datasets excluding multiwords (i.e. including only unigrams): Download this version here (please note that all results from the task description paper refer to the official complete dataset).

Download test data (Updated Jan 10, 2017 with some minor fixes): Test data for all languages (subtasks 1 and 2).

Download trial data (Updated Dec 23, 2016 with a minor scorer bug fixed for negative correlations): Trial data for all the languages, including gold keys and an evaluation script.

This task does not provide any annotated training data. However, we encourage all corpus-based system to use a common unlabeled corpus for training (see below for more details). The evaluation data consists of the following datasets:

Subtask 1: Five monolingual word similarity datasets of 500 word pairs each.
Subtask 2: Ten cross-lingual word similarity datasets in the range of 750-1000 word pairs each.

All the datasets are tab-separated, each line corresponding to a nominal pair: word1<tab>word2

Rating Scale

Word pairs will be evaluated by following their similarity (not to confuse with relatedness) in a [0-4] rating scale which was used by the SemEval 2014 task on Cross Level Semantic Similarity (Jurgens et al., 2014) and adapted to the word similarity task in Camacho-Collados et al. (2015). The scale is designed to systematically order a broad range of semantic relations: synonymy, similarity, relatedness, topical association, and unrelatedness. The rating scale is summarized by the following guidelines:

4: Very similar -- The two words are synonyms (e.g., midday-noon).
3: Similar -- The two words share many of the important ideas of their meaning but include slightly different details. They refer to similar but not identical concepts (e.g., lion-zebra).
2: Slightly similar -- The two words do not have a very similar meaning, but share a common topic/domain/function and ideas or concepts that are related (e.g., house-window).
1: Dissimilar -- The two items describe clearly dissimilar concepts, but may share some small details, a far relationship or a domain in common and might be likely to be found together in a longer document on the same topic (e.g., software-keyboard).
0: Totally dissimilar and unrelated -- The two items do not mean the same thing and are not on the same topic (e.g., pencil-frog).

Training Corpus

In this task we will not release new training data. However, given the importance corpus-based techniques (e.g. monolingual and bilingual word embeddings) are gaining in the NLP community for semantic representation and similarity tasks in recent years, we provide an additional separate ranking for these approaches. In order to mitigate the role that the underlying training corpus can play in the quality of obtained representations (hence in the final performance), we propose a fair comparison among corpus-based models using the same corpus. Therefore, to be considered on this category, participants must use the benchmark corpus for the training of their models (any preprocessing of the corpus is allowed):

The common corpus for subtask 1 will be the Wikipedia corpus corresponding to the given language. Tokenized Wikipedia dumps in text format for all the languages considered in the evaluation are available at https://sites.google.com/site/rmyeid/projects/polyglot

The common corpus for subtask 2 will be the Europarl parallel corpus. This corpus is available for all languages except Farsi. You can download the Europarl parallel corpus from http://opus.lingfil.uu.se/Europarl.php. For Farsi and any other language you can use the OpenSubtitles2016 parallel corpora: http://opus.lingfil.uu.se/OpenSubtitles2016.php. Note that these corpora may not have a full coverage of test items, so you can additionally consider exploiting other types of corpora (e.g. Wikipedia as a comparable corpus).

Other corpora may also be used for training, but in that case the system will be ranked in the general category only and will not be taken into account for this shared-corpus ranking. Please indicate in the submission if you will participate in this category (participants of this category will also be considered for the general ranking).

Data and Tools

Rating Scale

Training Corpus

Contact Info

Other Info

Announcements