Evaluation
Latest version of test data can be downloaded from HERE (updated January 10 with some minor fixes).
Updated evaluation datasets (Jan 10)
Despite that the fix only affects a very small portion of the datasets, please make sure to test on the latest version of the datasets as the ordering of the pairs in the crosslingual datasets may have changed compared to the last version.
The evaluation period started on Monday, January 9 and will end on Monday, January 30 (23:59 GMT-9).
To submit results you have to register and upload your results in the evaluation platform CodaLab. For submitting your results on any of the two subtasks, you should go to the CodaLab competition page of your subtask, click on “Participate->Submit”, fill out some basic information about the submission, and upload a .zip file with the result file/s (no intermediate folder required). Please read carefully the instructions to submit your results below and check the CodaLab Frequently Asked Questions if needed.
Subtask 1: Monolingual word similarity
CodaLab competition page for Subtask 1: https://competitions.codalab.org/competitions/15961
- The .zip file must contain at least 1 and at most 5 txt files, for the five monolingual datasets, each corresponding to the language on which you are planning to evaluate your system.
- The output files should be named “[language].output.txt” where [language] corresponds to the two letter code of the corresponding language (en: English, de: German, it: Italian, fa: Farsi, es: Spanish). For example, if you are planning to evaluate your system on English and German datasets only, you should compress two files named “en.output.txt” and “de.output.txt” in a single .zip file.
- The output files should contain 500 lines, each containing a similarity score for the word pair in the test dataset of the corresponding language. The format is the same as the gold-standard, as found in the keys included in the trial data.
Subtask 2: Cross-lingual word similarity
CodaLab competition page for Subtask 1: https://competitions.codalab.org/competitions/15962
- The .zip file must contain at least 1 and at most 10 txt files, for the ten crosslingual datasets, each corresponding to the language pair on which you are planning to evaluate your system.
- The output files should be named “[pair].output.txt” where [pair] corresponds to the two letter code of the corresponding language:
	                    de-es: German-Spanish (956 pairs)
	                    de-fa: German-Farsi (888 pairs)
	                    de-it: German-Italian (912 pairs)
	                    en-de: English-German (914 pairs)
	                    en-es: English-Spanish (978 pairs)
	                    en-fa: English-Farsi (952 pairs)
	                    en-it: English-Italian (970 pairs)
	                    es-fa: Spanish-Farsi (914 pairs)
	                    es-it: Spanish-Italian (967 pairs)
	                    it-fa: Italian-Farsi (916 pairs)
- For example, if you are planning to evaluate your system on German-Farsi and Spanish-Italian, and English-German language pairs, you should compress three files named “de-fa.output.txt”, “es-it.output.txt”, and “en-de.output.txt” in a single .zip file.
- 
		For each language pair, the output file must contain exactly the same number of lines as the corresponding test dataset. Each line should contain a similarity score for the word pair in the same line in the test dataset. The format is the same as the gold-standard, as found in the keys included in the trial data.
 
Additional information
The output format is the same for both Subtask-1 and Subtask-2. Each answer file corresponds to an input dataset and should have the same number of lines as the input dataset. Each line corresponds to the similarity score of the pair appearing in the same line of the input dataset (sample answer files available in the trial data). Important points to take into account:
- It is not needed to run tests on all the datasets. Please provide answer files only for the datasets you are planning to evaluate your model on.
- Note that while the monolingual datasets will have a fixed length of 500 word pairs each, the cross-lingual datasets vary in size.
- Since Pearson and Spearman correlation measures are not sensitive to the similarity scale, any consistent similarity scale can be used (e.g., [0-4], [-1,1], [0,1]).
- There should be no empty lines in the system output files. If your system does not cover a certain word in a pair, we recommend to set the score to the middle point of your similarity scale (for example 0.5 in the [0-1] scale).
- Each team will only be allowed to submit a maximum of two systems/runs.
Rankings
Participating systems will be evaluated according to standard Pearson and Spearman correlation measures on all word similarity datasets, with the final score being calculated as the the harmonic mean of Pearson and Spearman correlations. Systems will be allowed to participate in either Subtask-1, Subtask-2 or both.
Subtask 1: Monolingual word similarity
For subtask 1, systems may be monolingual (only applicable to a single language) or multilingual (applicable to different languages). Both monolingual and multilingual systems will be ranked in their respective language datasets. The individual score is calculated as the harmonic mean of Pearson and Spearman correlations on the corresponding dataset.
In addition to the individual score, multilingual and language-independent approaches will be given a global ranking if they provide outputs for at least four languages. The final global score for a system will be calculated by averaging the final individual scores on the four languages for which the system performed best.
Subtask 2: Cross-lingual word similarity
For the cross-lingual word similarity subtask, participating systems can provide score for a single cross-lingual dataset to be considered in the corresponding dataset ranking. The score of a system in an individual cross-lingual dataset is calculated as the harmonic mean of Pearson and Spearman correlations.
Additionally, to be considered for the global ranking multilingual approaches must provide results for at least six cross-lingual word similarity datasets. For each system, the global score will be calculated as the average of the individual scores on the six cross-lingual datasets on which it performs best.