Evaluation
This page will be regularly updated with information on the evaluation.
Evaluation Measures
The training dataset comes with a scorer which implements one different measure for each subtask.
All tasks will be evaluated on average F-score.
Task 1
The final score for task 1 is the result of several averages: The F-score is first computed for each category C:
F_C = (2*Prec_C*Rec_C)/(Prec_C+Rec_C)
where Prec_C = Correct_C / Retrieved_C
and Rec_C = Correct_C / Reference_C
Each category either belongs to the Syntax or Semantic layer. S_layer is the average score of categories.
S_layer_l = Sum(F_C_l) / n_C_l
and the score for each verb is the average of the layers:
S_verb = Sum(S_layer) / n_layer
and the final score for the task is the average of all verbs score:
Score_Task1 = Sum(S_verb) / n_verb
Task 2
Task 2 uses the B-cubed F-score, which differs from the standard F-score on the way it computes Recall and Precision. The same principle is applied to get Precision and Recall: these scores differ by interchanging the gold reference with the candidate run. Thus,
Prec = B^3_gold,
Rec = B^3_run,
F-score = 2*Prec*Rec / Prec+Rec
B^3 is a measure of cluster concordance which compares the identical pairs of data points between 2 clusterings, and weights this value by the size of each cluster, with respect to each data point. In other words, it provides a measure of precision and recall for each data point. The global Precision and Recall is obtained by averaging.
B^3_gold_i = Nb of pairs in run cluster found in gold / total nb of pairs in run_cluster
B^3_run_i = Nb of pairs in gold cluster found in run / total nb of pairs in gold_cluster
nb: the number of pairs also include the case where a pair is made of the same data point.
Prec = average(B^3_gold_i)
Rec = average(B^3_run_i)
F_C_verb = (2*Prec_C*Rec_C)/(Prec_C+Rec_C)
and the final score is
Score_Task2 = Sum(F_C_verb) / n_verb
Task 3
Task 3 enumerates the errors made by the system to output an F-score for each run based on the following error rate:
Error rate = Total number of slot errors / Total number of slots in the reference
= Substitution + Deletion + Insertion / Correct + Substitution + Deletion
In order to derive an F-score, we compute the following Precision and Recall measure
Precision = Correct / (Correct + Substitution + Insertion)
Recall = Correct / (Correct + Substitution + Deletion)
F1_verb = (2*Prec*Rec) / (Rec+Prec)
This measure is known to "deweight" the effect of D and I errors (see Makhoul et al, 1999).
There are 9 slots ('subject','object','indirect_object','noun_adjective_complement','verb_complement','preposition_1',
'adverbial_complement_1','preposition_2','adverbial_complement_2')
Slots are aligned automatically as follows: if a slot is filled in the gold and is filled in the candidate run, this is a "match". If not, then it is counted as a deletion. Similarly, if a slot is filled in the run but not in the gold, it is counted as an insertion. Now if a match (aligned slots) also features a semantic type match, it is considered as correct (1 point). If not, it is considered as a substitution. Thus only syntactic and semantic matches count for 1 point.
The current score also offers credit for partial semantic match. When the gold slot contains a hypernym of the semantic type in the candidate run slot, it is given 0.5 points. When the candidate run slot contains a hypernym of the semantic type in the gold slot, it is given 0.25 points. When such partial scores are credited, their difference to 1 point (S=1-C) is added to substitution counts (e.g. meaning it was half substituted, so half correct).
The final score is
Score_Task3 = Sum(F1_verb) / n_verb
References
- [Amigó&al09] Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Felisa Verdejo. 2009. "A comparison of extrinsic clustering evaluation metrics based on formal constraints". Information Retrieval 12, 4. 461-486.
- [Bagga&Baldwin98] Bagga, Amit and Baldwin, Breck. 1998. "Entity-Based Cross-Document Coreferencing Using the Vector Space Model". In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL’98). 79–85.
- [Makhoul&al.99] Makhoul, John, and Kubala, Francis, and Schwartz, Richard and Weischedel, Ralph. 1999. "Performance measures for information extraction". In Proceedings of DARPA Broadcast News Workshop. 249–252.