Evaluation
Task Evaluation
In Task 3, systems will be evaluated both (1) within comparison type and (2) across all comparison types. Systems that participate only in a single comparison type will be excluded from the all-comparison system ranking. However, their inclusion in the single-comparison type setting will enable us to identify any performance gap between more general systems and specialized ones.
The system outputs and gold standard ratings will be compared in two ways, using Pearson correlation and Spearman's rank correlation (rho). Pearson correlation tests the degree of similarity between the system's similarity ratings and the gold standard ratings. Spearman's rho tests the degree of similarity between the rankings of the items according to similarity.
Task 3 requires that systems report a similarity score for every pair; submissions with missing ratings will be rejected. However, we recognize that some pairs may be difficult for systems to answers. Therefore, we allow teams to report an additional value with each similarity score to indicate the system's confidence in the score. Confidence values must be in [0,1] where 0 indicates the system is least confident in the score and 1 indicates the system is most confident. Confidence values are reported in a second column of the tab-delimited system output:
score <tab>confidence
Lines without confidence values are assumed to have a confidence of 1 and are always included. The official scoring program provides a way to consider only similarity scores whose confidence values are at least at or above a threshold value. Here, scores below the threshold are treated as missing values when computing alpha. While these confidence-based scores will not be used in the official Task 3 evaluation, they can provide teams with an idea of how their system performs when omitting their least confident similarity scores.