Subtasks < SemEval-2016 Task 3

Subtasks

English Subtasks

There are three reranking subtasks associated with the English dataset. Subtask A is the same as subtask A at SemEval-2015 Task 3, but with slightly different annotation and a different evaluation measure.

Subtask A: Question-Comment Similarity

Given

a question and
its first 10 comments in the question thread,

rerank these 10 comments according to their relevance with respect to the question.

We want the "Good" comments to be ranked above the "PotentiallyUseful" or "Bad" comments; the latter two will not be distinguished and will be considered "Bad" in terms of evaluation. The gold labels for this subtask are contained in the RELC_RELEVANCE2RELQ field of the XML files. See the datasets README file for a detailed description of the XML file format. Although we target the first 10 comments in terms of their time of posting (rather than in terms of their relevance), the output of the system will be a ranked list of comments (in terms of probability of relevance). Therefore, this is a ranking task, not a classification task.

Evaluation: The official scorer provides a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked will be mean average precision (MAP) using the 10 ranked comments.

Note: the dataset is formatted for training in this subtask. The format required for the output of your systems will be detailed in the scorer and format-checker README files.

Subtask B: Question-Question Similarity

Given

a new question (aka original question) and
the set of the first 10 related questions (retrieved by a search engine),

rerank the related questions according to their similarity with respect to the original question. In this case, we will consider the "PerfectMatch" and "Relevant" questions both as good (i.e., we will not distinguish between them and we will consider them both "Relevant"), and they should be ranked above the "Irrelevant" questions. The gold labels for this subtask are contained in the RELQ_RELEVANCE2ORGQ field of the XML file. See the README file for a detailed explanation of their meaning. Again, this is not a classification task; it is a ranking task.

Evaluation: As in subtask A, the official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked is MAP using the 10 ranked questions.

Note: also in this case, the dataset is formatted for training in this subtask. For each original question, you have to consider the 10 related questions associated with it. They are consecutive in the dataset. The format required for the output of your systems will be detailed in the scorer and in the format-checker README files.

Subtask C: Question-External Comment Similarity -- this is the main English subtask.

Given:

a new question (aka the original question),
the set of the first 10 related questions (retrieved by a search engine), each associated with its first 10 comments appearing in its thread,

rerank the 100 comments (10 questions x 10 comments) according to their relevance with respect to the original question. We want the "Good" comments to be ranked above the "PotentiallyUseful" or "Bad" comments, which will be considered just bad in terms of evaluation (the gold labels are contained in the RELC_RELEVANCE2ORGQ field of the related XML file). We will evaluate the position of good comments in the rank; thus, this is again a ranking task.

Although, the systems are supposed to work on 100 comments, we take an application-oriented view in the evaluation: we assume that potential users are presented with a relatively short list of candidate answers (e.g., 10 as in common search engines today). Thus, the users would like to have good comments to be concentrated in the first 10 positions, (i.e., all good comments ranked before any non-good comment). We believe the user cares much less about what happens in lower positions (e.g., after the 10th) in the rank, as they typically do not ask for the next page of the next 10 comments. This will be reflected in our primary evaluation score, which only considers the top-10 results.

Evaluation: The official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked will be MAP using the first 10 ranked comments only.

Note: The datasets are already provided in a form appropriate for this subtask. For each original question there is a list of 10 related questions with 10 comments each (see the README file that comes with the data distribution). The test set will follow the same format. The format required for the output of your systems will be detailed in the scorer and in the format-checker README files.

Arabic Subtask

Task D: Rerank the correct answers for a new question.

Given the extra-challenges that the Arabic language entails (e.g., it is not spoken by most NLP researchers and there are less resources and toolkits available), we target only one task, which is a simplified version of Subtask C for English.

Given

a new question (aka the original question),
the set of the first 30 related questions (retrieved by a search engine), each associated with one correct answer (which typically have a size of one or two paragraphs),

rerank the 30 question-answer pairs according to their relevance with respect to the original question. We want the "Direct" (D) and the "Relevant" (R) answers to be ranked above "Irrelevant" answers (I); the former two will be considered "Relevant" in terms of evaluation (gold labels are contained in the QArel field of the XML file). We will evaluate the position of "Relevant" answers in the rank, therefore, this is again a ranking task.

Unlike the English subtasks, here we use 30 answers since the retrieval task is much more difficult, leading to low recall, and the frequency of correct answers is much lower.

Evaluation: The official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked will be MAP using the top-10 ranked question-answer pairs.

Note: The datasets are already provided in a form appropriate for this subtask. For each original question there is a list of 30 related question-answer pairs (see the README file that comes with the data distribution). The test set will follow the same format. The format required for the output of your systems will be detailed in the scorer and in the format-checker README files. We used a different terminology to better characterize the different Arabic subtask, which is much more similar to a traditional QA task. Said that, "Direct", "Relevant" and "Irrelevant" may be roughly mapped to "PerfectMatch", "Relevant" and "Bad", respectively, in the English Subtask B. Note, however, that for the Arabic subtask D, we only evaluate positively the "Direct" and teh "Relevant" answers, while the "Irrelevant" ones have to be pushed below them.

SemEval-2016 Task 3

Subtasks

Contact Info

Organizers

Other Info

Announcements