Evaluation < SemEval-2015 Task 15

Evaluation

This page will be regularly updated with information on the evaluation.

Evaluation Measures

The training dataset comes with a scorer which implements one different measure for each subtask.

All tasks will be evaluated on average F-score.

Task 1

The final score for task 1 is the result of several averages: The F-score is first computed for each category C:

F_C = (2*Prec_C*Rec_C)/(Prec_C+Rec_C)

where Prec_C = Correct_C / Retrieved_C
and Rec_C = Correct_C / Reference_C

Each category either belongs to the Syntax or Semantic layer. S_layer is the average score of categories.

S_layer_l = Sum(F_C_l) / n_C_l

and the score for each verb is the average of the layers:

S_verb = Sum(S_layer) / n_layer

and the final score for the task is the average of all verbs score:

Score_Task1 = Sum(S_verb) / n_verb

Task 2

Task 2 uses the B-cubed F-score, which differs from the standard F-score on the way it computes Recall and Precision. The same principle is applied to get Precision and Recall: these scores differ by interchanging the gold reference with the candidate run. Thus,

Prec = B^3_gold,

Rec = B^3_run,

F-score = 2*Prec*Rec / Prec+Rec

B^3 is a measure of cluster concordance which compares the identical pairs of data points between 2 clusterings, and weights this value by the size of each cluster, with respect to each data point. In other words, it provides a measure of precision and recall for each data point. The global Precision and Recall is obtained by averaging.

B^3_gold_i = Nb of pairs in run cluster found in gold / total nb of pairs in run_cluster

B^3_run_i = Nb of pairs in gold cluster found in run / total nb of pairs in gold_cluster

nb: the number of pairs also include the case where a pair is made of the same data point.

Prec = average(B^3_gold_i)

Rec = average(B^3_run_i)

F_C_verb = (2*Prec_C*Rec_C)/(Prec_C+Rec_C)

and the final score is

Score_Task2 = Sum(F_C_verb) / n_verb

Task 3

Task 3 enumerates the errors made by the system to output an F-score for each run based on the following error rate:

Error rate = Total number of slot errors / Total number of slots in the reference
= Substitution + Deletion + Insertion / Correct + Substitution + Deletion

In order to derive an F-score, we compute the following Precision and Recall measure

Precision = Correct / (Correct + Substitution + Insertion)
Recall = Correct / (Correct + Substitution + Deletion)
F1_verb = (2*Prec*Rec) / (Rec+Prec)

This measure is known to "deweight" the effect of D and I errors (see Makhoul et al, 1999).

There are 9 slots ('subject','object','indirect_object','noun_adjective_complement','verb_complement','preposition_1',
'adverbial_complement_1','preposition_2','adverbial_complement_2')

Slots are aligned automatically as follows: if a slot is filled in the gold and is filled in the candidate run, this is a "match". If not, then it is counted as a deletion. Similarly, if a slot is filled in the run but not in the gold, it is counted as an insertion. Now if a match (aligned slots) also features a semantic type match, it is considered as correct (1 point). If not, it is considered as a substitution. Thus only syntactic and semantic matches count for 1 point.
The current score also offers credit for partial semantic match. When the gold slot contains a hypernym of the semantic type in the candidate run slot, it is given 0.5 points. When the candidate run slot contains a hypernym of the semantic type in the gold slot, it is given 0.25 points. When such partial scores are credited, their difference to 1 point (S=1-C) is added to substitution counts (e.g. meaning it was half substituted, so half correct).

The final score is

Score_Task3 = Sum(F1_verb) / n_verb

References

[Amigó&al09] Amigó, Enrique and Gonzalo, Julio and Artiles, Javier and Felisa Verdejo. 2009. "A comparison of extrinsic clustering evaluation metrics based on formal constraints". Information Retrieval 12, 4. 461-486.
[Bagga&Baldwin98] Bagga, Amit and Baldwin, Breck. 1998. "Entity-Based Cross-Document Coreferencing Using the Vector Space Model". In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL’98). 79–85.
[Makhoul&al.99] Makhoul, John, and Kubala, Francis, and Schwartz, Richard and Weischedel, Ralph. 1999. "Performance measures for information extraction". In Proceedings of DARPA Broadcast News Workshop. 249–252.

SemEval-2015 Task 15

Evaluation

Evaluation Measures

Task 1

Task 2

Task 3

Contact Info

Announcements