Data and Tools

I. Codalab *NEW*

The following are development sets and instructions to be used to practice submitting your output to Codalab. We encourage you to begin testing uploading immediately. Initially, you may want to upload the baseline provided in the zipped files.

II. English Training Data

III. Arabic Training Data

IV. Download Scripts

V. Test input:

VI. Arabic+English training data:


1. For English, we provide a default split of the data from previous years into training, development and development-time testing datasets, participants are free to use this data in any way they find useful when training and tuning their systems, e.g., use a different split, perform cross-validation, train on all datasets, etc.

2. For English, unlike in previous years, for SemEval-2017 Task 4, there was no progress testing, and thus all the provided data could be used for training and development.



  • All training data can be found here.
  • The test data can be found here.
  • The gold labels, submissions and scores for all teams can be found here.
  • The task paper can be found here.

  author    = {Sara Rosenthal and Noura Farra and Preslav Nakov},
  title     = {{SemEval}-2017 Task 4: Sentiment Analysis in {T}witter},
  booktitle = {Proceedings of the 11th International Workshop on Semantic Evaluation},
  series    = {SemEval '17},
  month     = {August},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},

Contact Info

  • Sara Rosenthal, IBM Research
  • Noura Farra, Columbia University
  • Preslav Nakov, Qatar Computing Research Institute, HBKU

Other Info


  • Results, and gold labels are released
  • Arabic and English TEST INPUT v1.0 for phase 2 (subtasks B, D) released
  • Arabic and English TEST INPUT v3.0 for phase 1 (subtasks A, C, E) released
  • Arabic and English training data released
  • CodaLab development sets on Data and Tools page