Subtasks
There are five subtasks:
- Subtask A (English): Question-Comment Similarity
- Subtask B (English): Question-Question Similarity
- Subtask C (English): Question-External Comment Similarity
- Subtask D (Arabic): Rerank correct answers for a new Question
- Subtask E (English): Multi-Domain Duplicate Question Detection
Note: For instructions on how to submit system results to any of these tasks, please check this page.
English Subtasks
There are three reranking subtasks associated with the English dataset. Subtask A is the same as subtask A at SemEval-2015 Task 3, but with slightly different annotation and a different evaluation measure.
Subtask A: Question-Comment Similarity
Given
- a question and
- its first 10 comments in the question thread,
rerank these 10 comments according to their relevance with respect to the question.
We want the "Good" comments to be ranked above the "PotentiallyUseful" or "Bad" comments; the latter two will not be distinguished and will be considered "Bad" in terms of evaluation. The gold labels for this subtask are contained in the RELC_RELEVANCE2RELQ field of the XML files. See the datasets README file for a detailed description of the XML file format. Although we target the first 10 comments in terms of their time of posting (rather than in terms of their relevance), the output of the system will be a ranked list of comments (in terms of probability of relevance). Therefore, this is a ranking task, not a classification task.
Evaluation: The official scorer provides a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked will be mean average precision (MAP) using the 10 ranked comments.
Note: the dataset is formatted for training in this subtask. The format required for the output of your systems will be detailed in the scorer and format-checker README files.
Subtask B: Question-Question Similarity
Given
- a new question (aka original question) and
- the set of the first 10 related questions (retrieved by a search engine),
rerank the related questions according to their similarity with respect to the original question. In this case, we will consider the "PerfectMatch" and "Relevant" questions both as good (i.e., we will not distinguish between them and we will consider them both "Relevant"), and they should be ranked above the "Irrelevant" questions. The gold labels for this subtask are contained in the RELQ_RELEVANCE2ORGQ field of the XML file. See the README file for a detailed explanation of their meaning. Again, this is not a classification task; it is a ranking task.
Evaluation: As in subtask A, the official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked is MAP using the 10 ranked questions.
Note: also in this case, the dataset is formatted for training in this subtask. For each original question, you have to consider the 10 related questions associated with it. They are consecutive in the dataset. The format required for the output of your systems will be detailed in the scorer and in the format-checker README files.
Subtask C: Question-External Comment Similarity -- this is the main English subtask.
Given:
- a new question (aka the original question),
- the set of the first 10 related questions (retrieved by a search engine), each associated with its first 10 comments appearing in its thread,
rerank the 100 comments (10 questions x 10 comments) according to their relevance with respect to the original question. We want the "Good" comments to be ranked above the "PotentiallyUseful" or "Bad" comments, which will be considered just bad in terms of evaluation (the gold labels are contained in the RELC_RELEVANCE2ORGQ field of the related XML file). We will evaluate the position of good comments in the rank; thus, this is again a ranking task.
Although, the systems are supposed to work on 100 comments, we take an application-oriented view in the evaluation: we assume that potential users are presented with a relatively short list of candidate answers (e.g., 10 as in common search engines today). Thus, the users would like to have good comments to be concentrated in the first 10 positions, (i.e., all good comments ranked before any non-good comment). We believe the user cares much less about what happens in lower positions (e.g., after the 10th) in the rank, as they typically do not ask for the next page of the next 10 comments. This will be reflected in our primary evaluation score, which only considers the top-10 results.
Evaluation: The official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked will be MAP using the first 10 ranked comments only.
Note: The datasets are already provided in a form appropriate for this subtask. For each original question there is a list of 10 related questions with 10 comments each (see the README file that comes with the data distribution). The test set will follow the same format. The format required for the output of your systems will be detailed in the scorer and in the format-checker README files.
Task D: Rerank the correct answers for a new question.
Given the extra-challenges that the Arabic language entails (e.g., it is not spoken by most NLP researchers and there are less resources and toolkits available), we target only one task, which is a simplified version of Subtask C for English.
Given
- a new question (aka the original question),
- the set of the first 30 related questions (retrieved by a search engine), each associated with one correct answer (which typically have a size of one or two paragraphs),
rerank the 30 question-answer pairs according to their relevance with respect to the original question. We want the "Direct" (D) and the "Relevant" (R) answers to be ranked above "Irrelevant" answers (I); the former two will be considered "Relevant" in terms of evaluation (gold labels are contained in the QArel field of the XML file). We will evaluate the position of "Relevant" answers in the rank, therefore, this is again a ranking task.
Unlike the English subtasks, here we use 30 answers since the retrieval task is much more difficult, leading to low recall, and the frequency of correct answers is much lower.
Evaluation: The official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked will be MAP using the top-10 ranked question-answer pairs.
Note: The datasets are already provided in a form appropriate for this subtask. For each original question there is a list of 30 related question-answer pairs (see the README file that comes with the data distribution). The test set will follow the same format. The format required for the output of your systems will be detailed in the scorer and in the format-checker README files. We used a different terminology to better characterize the different Arabic subtask, which is much more similar to a traditional QA task. Said that, "Direct", "Relevant" and "Irrelevant" may be roughly mapped to "PerfectMatch", "Relevant" and "Bad", respectively, in the English Subtask B. Note, however, that for the Arabic subtask D, we only evaluate positively the "Direct" and teh "Relevant" answers, while the "Irrelevant" ones have to be pushed below them.
Multi-Domain Duplicate Detection Subtask (CQADupStack Task)
Task E: Identify duplicate questions in StackExchange.
Given:
- a new question (aka the original question),
- a set of 50 candidate questions,
rerank the 50 candidate questions according to their relevance with respect to the original question, and truncate the result list in such a way that only "PerfectMatch" questions appear in it. "Related" and "Irrelevant" questions should not be returned in the truncated list. The gold labels are contained in the RELC_RELEVANCE2ORGQ field of the related XML file. We will evaluate both the position of good questions in the rank, and the length of the returned result list based on the number of good questions existed for each original question; thus, this is again a ranking task, but at the same time a result list truncation task.
Evaluation: The official scorer will provide a number of evaluation measures to assess the quality of the output of a system (see the tools page), but the official evaluation measure towards which all systems will be evaluated and ranked will be MAP' as adjusted for truncated lists (Liu et al. 2016).
Apart from providing test data in the same domains as the development and training data, we will also supply two different test sets from other domains. The best system will therefore be the one that not only performs well in the ranking and result list truncation of the test data in the same domain, but will also perform well in a cross-domain setting.
Data: The datasets that are supplied for this subtask will span four different domains. The format will be the same as for the other subtasks, but there will be two extra layers. Each original question has 50 candidate questions, and these related questions each have a number of comments. On top of that, they have a number of answers, and each answer may have comments as well. The difference between answers and comments is that answers should contain a well formed answer to the question, while comments contain things like requests for clarification, remarks, small additions to someone else's answer, etc. Since the content of StackExchange is provided by the community, this distinction does not consistently hold.
The data has also been extended with some meta data fields like the tags that are associated with each question, the number of times a question has been viewed, and the score of each question, answer and comment (the number of upvotes it has received from the community, minus the number of downvotes). Separate files with user statistics are provided, containing information like user reputation, user badges, etc. Participants are allowed to use any of the information provided when calculating their rankings and truncations.
The relevance labels in the development and training data provided come directly from the users of the StackExchange sites. Users of the forums can vote for questions to be closed as a duplicate of another one. These are the questions labeled as 'PerfectMatch'. The questions labeled as 'Related' are question that are not duplicates, but that are somehow similar to the original question, also as judged by the StackExchange community. It is possible that some duplicate labels are missing, due to the voluntary nature of the duplicate labeling on StackExchange. The development and training data should therefore be considered a silver standard. For the test data we will conduct exhaustive annotations, to validate whether each candidate question is a duplicate of the new question.