Task Details

Given a pair of words, the task is to automatically measure their semantic similarity. All pairs in our datasets are scored according to a [0-4] similarity scale, where 4 denotes that the two words are synonymous and 0 indicates that they are completely dissimilar. Find below two sample pairs and their human-assigned scores:

  • sunset - string: 0.05
  • computer science - mathematics: 3.1
  • automobile - car: 3.82

The two words in a pair may belong to the same language (subtask 1, i.e., monolingual) or to two different languages (subtask 2, i.e., cross-lingual). All the datasets created for this task will follow the framework for the construction of monolingual and cross-lingual word similarity datasets proposed by Camacho-Collados et al. (ACL 2015).


Subtask 1: Multilingual word similarity


This subtask provides five monolingual word similarity datasets in English, German, Italian, Spanish and Farsi. The subtask is intended to test not only monolingual approaches but also multilingual and language-independent techniques. While monolingual approaches will be evaluated in their corresponding language datasets, multilingual and language-independent techniques will also be given a global score (see Evaluation for more details). We included Farsi as an under-resourced language from a different family in order to provide a framework for models that do not rely on many external tools and can be effectively applied to less-resourced languages.



Subtask 2: Cross-lingual word similarity


In the cross-lingual word similarity subtask each word pair is composed by words in different languages (e.g. building-habitación). This subtask is composed of ten cross-lingual word similarity datasets: EN-DE, EN-ES, EN-FA, EN-IT, DE-ES, DE-FA, DE-IT, ES-FA, ES-IT, and FA-IT. The subtask is intended to test bilingual and multilingual semantic representation techniques.


(EN: English, DE: German, IT: Italian, FA: Farsi, ES: Spanish)





José Camacho-Collados, Mohammad Taher Pilehvar and Roberto Navigli (2015) A Framework for the Construction of Monolingual and Cross-lingual Word Similarity Datasets. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), short papers, Beijing, China, July 27-29, pp. 1-7.

Contact Info

[*Contact persons]

collados [at] di.uniroma1.it
mp792 [at] cam.ac.uk

Join our Google Group:

Other Info