About the Data:

Participants will be presented with a training and a test set. The training set will be composed of 2,237 instances, and the test set of 88,221. The data was collected through a survey, in which 400 annotators were presented with several sentences and asked to select which ones they did not understand the meaning of. The training set is composed by the judgments of 20 distinct annotators over a set of 200 sentences, while the test set is composed by the judgments made over 9,000 sentences by only one annotator. In the training set, a word is considered to be complex if at least one of the 20 annotators judged them so. With this setup, we create a scenario that replicates one of the biggest challenges in Lexical Simplification: to predict the vocabulary limitations of individuals based on the overall limitations of a group which they are part of. Information about the data's format is present in the README.txt included in the training and test sets.



For each system, participants must submit a plain text file containing the same number of lines as the test set. Each line of the submission file must be a binary label: 1 if the word in that instance is complex, or 0 otherwise

To submit your system, please visit the submission page.

Once you have uploaded your system(s), please send an e-mail to using the following template:

Title: System Description - CWI SemEval 2016
Team Name:
System 1 Name:
System 2 Name:

The descriptions can have a maximum of 150 words each.
You must send your system and system descriptions until January 31 in order to be qualified.



  • You may use any external resources that you want.
  • Each participant can submit a maximum of 2 systems.
  • Participants who submit more than 2 systems will be disqualified.
  • The format of the submission must conform to the standards specified.
  • Submissions in the wrong format will be disqualified.



The evaluation metric used to rank the submitted systems will be the G-score, which is the harmonic mean between Accuracy and Recall. By using Accuracy instead of the traditional Precision, we reward those systems which can identify as many complex words as possible, while still correctly predicting the complexity of not only complex, but rather all words. We will also report the results for Precision and the traditional F-score.





Published on  February 6th, 2016