This page will be regularly updated with information on the evaluation.


Evaluation Measures

The training dataset comes with a scorer which implements one different measure for each subtask.

All tasks will be evaluated on average F-score.


Task 1

The final score for task 1 is the result of several averages: The F-score is first computed for each category C:

F_C = (2*Prec_C*Rec_C)/(Prec_C+Rec_C)

           where Prec_C = Correct_C / Retrieved_C
           and Rec_C = Correct_C / Reference_C


Each category either belongs to the Syntax or Semantic layer. S_layer is the average score of categories.


S_layer_l = Sum(F_C_l) / n_C_l


and the score for each verb is the average of the layers:


S_verb = Sum(S_layer) / n_layer


and the final score for the task is the average of all verbs score:


Score_Task1 = Sum(S_verb) / n_verb


Task 2

Task 2 uses the B-cubed F-score, which differs from the standard F-score on the way it computes Recall and Precision. The same principle is applied to get Precision and Recall: these scores differ by interchanging the gold reference with the candidate run. Thus,


Prec = B^3_gold,

Rec = B^3_run,

F-score = 2*Prec*Rec / Prec+Rec


B^3 is a measure of cluster concordance which compares the identical pairs of data points between 2 clusterings, and weights this value by the size of each cluster, with respect to each data point. In other words, it provides a measure of precision and recall for each data point. The global Precision and Recall is obtained by averaging.


B^3_gold_i = Nb of pairs in run cluster found in gold / total nb of pairs in run_cluster

B^3_run_i = Nb of pairs in gold cluster found in run / total nb of pairs in gold_cluster

nb: the number of pairs also include the case where a pair is made of the same data point.


Prec = average(B^3_gold_i)

Rec = average(B^3_run_i)

F_C_verb = (2*Prec_C*Rec_C)/(Prec_C+Rec_C)


and the final score is


Score_Task2 = Sum(F_C_verb) / n_verb


Task 3

Task 3 enumerates the errors made by the system to output an F-score for each run based on the following error rate:


Error rate = Total number of slot errors / Total number of slots in the reference
                  = Substitution + Deletion + Insertion / Correct + Substitution + Deletion


In order to derive an F-score, we compute the following Precision and Recall measure


Precision = Correct / (Correct + Substitution + Insertion)
Recall = Correct / (Correct + Substitution + Deletion)
F1_verb = (2*Prec*Rec) / (Rec+Prec)


This measure is known to "deweight" the effect of D and I errors (see Makhoul et al, 1999).


There are 9 slots ('subject','object','indirect_object','noun_adjective_complement','verb_complement','preposition_1',


Slots are aligned automatically as follows: if a slot is filled in the gold and is filled in the candidate run, this is a "match". If not, then it is counted as a deletion. Similarly, if a slot is filled in the run but not in the gold, it is counted as an insertion. Now if a match (aligned slots) also features a semantic type match, it is considered as correct (1 point). If not, it is considered as a substitution. Thus only syntactic and semantic matches count for 1 point.
The current score also offers credit for partial semantic match. When the gold slot contains a hypernym of the semantic type in the candidate run slot, it is given 0.5 points. When the candidate run slot contains a hypernym of the semantic type in the gold slot, it is given 0.25 points. When such partial scores are credited, their difference to 1 point (S=1-C) is added to substitution counts (e.g. meaning it was half substituted, so half correct).


The final score is

Score_Task3 = Sum(F1_verb) / n_verb



  • [Amigó&al09] Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Felisa Verdejo. 2009. "A comparison of extrinsic clustering evaluation metrics based on formal constraints". Information Retrieval 12, 4. 461-486.
  • [Bagga&Baldwin98] Bagga, Amit and Baldwin, Breck. 1998. "Entity-Based Cross-Document Coreferencing Using the Vector Space Model". In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL’98). 79–85.
  • [Makhoul&al.99] Makhoul, John, and Kubala, Francis, and Schwartz, Richard and Weischedel, Ralph. 1999. "Performance measures for information extraction". In Proceedings of DARPA Broadcast News Workshop. 249–252.

Contact Info

  • Vít Baisa (Masaryk University, Brno, CZ),
  • Jane Bradbury (University of Wolverhampton, UK),
  • Ismaïl El Maarouf (University of Wolverhampton, UK),
  • Patrick Hanks (University of Wolverhampton, UK),
  • Adam Kilgarriff (Lexical Computing Ltd, UK),
  • Octavian Popescu (FBK, Trento, IT)


  • September, 29th: Train data has been updated!
  • August, 19th: Train data has been released!
  • June, 3rd: Trial data has been released!
  • June, 5th: Google group for discussion has been created, you can send us an email, use the address: semeval2015task15@googlegroups.com