Task 3: Community Question Answering


NOTE: The results and all scores have been released



Community Question Answering (CQA) forums are gaining popularity online. They are seldom moderated, rather open, and thus they have few restrictions, if any, on who can post and who can answer a question. On the positive side, this means that one can freely ask any question and expect some good, honest answers. On the negative side, it takes effort to go through all possible answers and to make sense of them. For example, it is not unusual for a question to have hundreds of answers, which makes it very time consuming to the user to inspect and to winnow. The challenge we propose may help automate the process of finding good answers to new questions in a community-created discussion forum (e.g., by retrieving similar questions in the forum and identifying the posts in the answer threads of those questions that answer the question well).

We build on the success of the previous editions of our SemEval tasks on CQA, SemEval-2015 Task 3 and SemEval-2016 Task 3, and present an extended edition for SemEval- 2017, which incorporates several novel facets.


Challenge Tasks

Our main CQA task, as in 2016, is:

“given (i) a new question and (ii) a large collection of question-answer threads created by a user community, rank the answer posts that are most useful for answering the new question.”

This is a ranking task. The given question is new with respect to the collection, but is expected to be related to one or several questions in the collection. The best answers can come from different question-answer threads. In the collection, the threads are independent of each other and the lists of answers are chronologically sorted containing some meta-information (e.g., date, user, topic, etc.). The answer posts in a particular thread are intended to answer the question initiating that thread, but since this is a resource created by a community of casual users, there is a lot of noise and irrelevant material, apart from informal language usage and lots of typos and grammatical mistakes. Moreover, the questions in the collection can be related to each other, although not explicitly.

Additionally, we propose two sub-tasks:

  1. Question Similarity (QS): given the new question and a set of related questions from the collection, rank the similar questions according to their similarity to the original question (with the idea that the answers to the similar questions should be answering the new question as well).
  2. Relevance Classification (RC): given a question from a question-answer thread, rank the answer posts according to their relevance with respect to the question.

Task 1 and 2 can be used to solve the main task. Nonetheless, one can approach it without necessarily going through a pipeline of 1+2. Participants will be free to use whatever approach they want. Note also that the participation in the main task and/or the two subtasks will be optional. Participants can go for any possible combination. We will have separate evaluations and analysis for CQA, QS and RC.

Note that the tasks above are a good testbed for developing other semantic techologies, e.g., textual entailment, semantic similarity, paraphrasing, and natural language inference, inter alia.


Novelties of 2017

We incorporate the largest duplicate question data set made available to date, CQADupStack, released in late 2015. The resource consists of over 7 million threads from 12 StackExchange subforums (each focusing on a particular domain, such as statistics, gaming or TeX). Labels of duplicate questions will be based in part on forum user-provided annotations of duplicates and in part extended by our own annotations.

We propose to use this resource in a new duplicate question detection sub-task. This sub-task is similar to QA but differs in two important aspects. Firstly, it requires carrying out cross-domain question retrieval, by providing training data for only a fraction of the 12 forums (identifying the source forum for each training instance), while test data is derived from a broader set of forums (again, identifying the source forum for each test instance; however, we included “surprise” forums which will only be identified on release of the test data).

Secondly, we introduce an aspect that has been largely ignored in duplicate question detection tasks due to limitations of existing evaluation metrics: result list truncation. While we will supply a relatively large number of candidate questions per query question, in practice only some (or, in many instances, none!) of these will be duplicates. Participating systems should return only those questions they consider to be duplicates, in order of declining confidence. The resulting truncated lists, which may be empty for some queries, will be evaluated using an extension of Mean Average Precision (MAP) that supports the evaluation of truncated lists (Liu et al. 2016).


Data and Languages

Keeping with the multilinguality feature from the past editions, for the main task we will provide the data for two languages: English and Arabic.  We will reuse the same datasets from 2016 but annotate fresh test sets for all subtasks and languages. More importantly, we will incorporate the English CQADupStack datasets, amde up of 7,214,697 threads across 12 subforums extracted from StackExchange. Users of StackExchange mark questions as duplicates when they are noticed to have been previously answered. This provides a substantial resource for exploring the semantic similarity or entailment among questions and question-answer threads.

For a precise definition of all subtasks and the evaluation see the Task Description page. The corpora and the tools can be downloaded from the Data and Tools page.


Examples for all Tasks and Languages


(Simplified) English Example

Let us consider the following question, which is NOT in Qatar Living (QL):


Q: Can I drive with an Australian driver’s license in Qatar?


Retrieved Questions:


Q1: How long can i drive in Qatar with my international driver's permit before I'm forced to change my Australian license to a Qatari one? When I do change over to a Qatar license do I actually lose my Australian license? I'd prefer to keep it if possible...

--->  question similar to Q1 (Task B)


Comment to Q1:

depends on the insurer, Qatar Insurance Company said this in email to me
“Thank you for your email! With regards to your query below, a foreigner is valid to drive in Doha with the following conditions: Foreign driver with his country valid driving license allowed driving only for one week from entry date Foreign driver with international valid driving license allowed driving for 6 months from entry date Foreign driver with GCC driving license allowed driving for 3 months from entry”
As an Aussie your driving licence should be transferable to a Qatar one with only the eyetest (temporary, then permanent once RP sorted).

---> good answer to Q1 (Task A) :

---> good answer to Q (Task C, main task)


Q2: Hi there :D dose anyone knows how much would it cost to get a driving license !! although i have had it before in my country so practically i know how to drive. any HELP !?

---> not similar to Q, i.e., a negative example for Task B


Comment to Q2: Why no-more short course? let me know plz. if someone knows driving very well then he/she can get license after shore course ?

---> negative example for Q2 (Task A)

---> negative example for Q (Task C)


Note that instead of just classifying or finding a limited set of similar questions and correct answers, we will ask the participants to provide relevance rankings.



(Simplified) Arabic Example


New question:


Q: الى ماذا يمكن ان يدل الالم في الاسنان الذي يظهر فقط عند شرب او اكل شيئ ساخن او بارد وفي اوقات اخرى لا يوجد الم. ما هو العلاج 

(What does the pain of teeth mean when it only appears when drinking something cold or eating something hot or cold, while it doesn't exist otherwise?)


Retrieved questions:


Q1:عند شرب الماء البارد اشعر ببعض الالم في مجموعه من الاسنان فما هو السبب </QAquestion>

(While drinking cold water I feel pain in some of my teeth, what is the reason?)

 A1: اسباب حساسية الاسنان يعود الى تاكل الطبقة الخارجية الحامية للاسنان و التي تسمى المينا .... و انكشف طبقة العاج الغنية بالمستقبلات الحسية .... و هناك عدة مسببات منها: التسوسات و النخور - انحسار اللثة بسبب الالتهابات اللثوية و اربطة السن- التفريش الخاطئ للاسنان - الاكثار من استخدام المنتجات المبيضة للاسنان - كثرة تناول الاطعمة و المشروبات الغازية الحمضية - انكسار الاسنان لا بد من الاسراع بمراجعة الطبيب حتى لا تتفاقم المشكلة

(The reason of teeth sensivity is the decay of the outer layer the protets the teeth ...)

Good Answer (label: Direct)


Q2: منذ شهر تقريبا خلعت طاحونة العقل واصبح هناك الم بالضرس المجاور لها واخبرتوني ان الجذر انكشف ولكني لا زلت اعاني عند شرب الماء البارد فالي متي سيبقي الجذر مكشوفا وهل هناك علاج مفيد لتلك الحالة </QAquestion>
(1 month ago, I removed my wisdome tooth, and since then I get pain in the next tooth. You answered before that the nerve is exposed and it should take a while. I still have pain when I drink cold water, How long this would take? Is there a cure to this pain?)

A2:ان قلع ضرس العقل يترك خلفه حفره بجا نب الضرس المجاور وتظهر هذه الاعراض وخاصة عند شرب البارد والساخن واكل السكاكر والحامض لا يوجد وقت محدد ولكن عندما تلتئم الحفره تماما يزول الالم ولكن بامكانك استخدام معاحين اسنان للحساسيه والمضمضه بمضمضات تحتوى على الفلورايد مثل la calut sensetive mouth wash وفى اعتقادى ان المساله مسالة وقت

(Removing a tooth makes the next one exposed and these are the symptons that you have. There is no exact time for cure, but you can use "la calut sensetive mouth wash" that can help)


Relevant Answer: (label: Relevant)

Q3: اعاني من شق في احد اسناني الامامية وكذلك لاحضت مؤخرا تكدس الطعام بين الاسنان عند الاكل بمعني يلزمني اني افرشي اسناني بعد اي اكل يدخل فمي فما السبب والعلاج 
(I have a crack in one of my teeth and food stuck there and I have to clean every time I eat. What can be the reason of this to happen?)

A3:ومن اسباب وجود الشقوق على طبقة المينا شرب السوائل البارده والساخنه في نفس الوقت.وكثرة تناول الحمضيات تضر في طبقة المينا.</QAanswer>
(Usually drinking cold and hot water at the same time could be the reason. Also eating a lot of acidic food causes this)


Irrelevant Answer: (label: Irrelevant)


The above rank would be optimal although in the evaluation we will not take into account the position of Relevant with respect to Irrelevant questions. Thus the important action would be to put the Direct answer on top.


(Simplified) English Example from StackExchange (Task E)

New question:


Title: Why do bread companies add sugar to bread?

Body: I have a client who is on a sugar detox/diet. She can't eat any bread because all the bread companies added sugar. Why do bread companies add sugar to their breads?


Duplicate question:


Title: What is the purpose of sugar in baking plain bread?

Body: My recipe says 1 tablespoon of sugar per loaf. This seems like too small an amount for flavor.

The recipe is as follows:

    3 cups flour
    1 teaspoon salt
    2 teaspoons active-dry yeast
    1 tablespoon sugar
    2 tablespoons oil
    1 cup water
    knead, wait 1 hr, knead again, wait 1.25 hr, bake for 30min @ 350

Is this for flavor, or is there another purpose?


Non-duplicate question:


Title: Is it safe to eat potatoes that have sprouted?

Body: I'm talking about potatoes that have gone somewhat soft and put out shoots about 10cm long. Other online discussions suggest it's reasonably safe and the majority of us have been peeling and eating soft sprouty spuds for years. Is this correct?


For each question (both new questions and candidate questions) the following metadata fields are available:


- The time and date of posting (RELQ_DATE, e.g. "2015-09-28 06:00:29")

- The userid of the person who posted the question (RELQ_USERID, e.g. "39643")

- The score (the number of upvotes minus the number of downvotes the question has received) (RELQ_SCORE, e.g. "2")

- The tags that are associated with the question (RELQ_TAGS, e.g. "indian-cuisine, texture")

- The number of times the question has been viewed (RELQ_VIEWCOUNT, e.g "82")

- The comments to the question (<RelComment>, e.g. "Are you trying to make it better next time, or salvage what you've already made?")

- The answers to the question (<RelAnswer>, e.g. "The main reason is yeast food. You may not actually need it if you're using instant yeast; either that, or you can bump it up a little for a slightly sweeter bread.")

- The name of the subforum the question came from (although this should be apparent from the file name too) (RELQ_CATEGORY, e.g. "cooking")


- RELQ_RANKING_ORDER is filled with a number indicating theorder in which the candidate questions are presented. It has no meaning beyond this.

For candidate questions, the relevance information is stored in a field called RELQ_RELEVANCE2ORGQ. The possible values are 'PerfectMatch', 'Related', and 'Irrelevant'. Your system should output only the questions labelled as 'PerfectMatch'.


For each answer similar metadata fields are available:


- The time and date of posting (RELA_DATE, e.g. "2015-09-29 10:44:51")

- The userid of the person who posted the answer (RELA_USERID, e.g. "37725")

- The score (the number of upvotes minus the number of downvotes the answer has received) (RELA_SCORE, e.g. "1")

- Whether or not the answer has been accepted as the best answer by the question asker (RELA_ACCEPTED, "1" for accepted, "0" for not accepted)

- The comments to the answer (<RelAComment>, e.g. "Probably not a preservative because molds adore sugar and will happily reproduce in a sugary environment, unless it is too sugary (in which case osmotic pressure will kill the mold). +1 for the rest.")


And for comments too there are some metadata fields available:


- The time and date of posting (RELA_DATE, e.g. "2015-09-29 11:57:05")

- The userid of the person who posted the answer (RELA_USERID, e.g. "17272")

- The score (the number of upvotes minus the number of downvotes the comment has received) (RELC_SCORE, e.g. "0")


A handful of metadatafields are there only to conform to the format of the data for subtasks A-D, but are never filled for subtask E. These are: RELQ_USERNAME, RELC_USERNAME, RELA_USERNAME, RELC_RELEVANCE2ORGQ, RELC_RELEVANCE2RELQ, RELA_RELEVANCE2ORGQ, RELA_RELEVANCE2RELQ, RELAC_RELEVANCE2ORGQ, and RELAC_RELEVANCE2RELQ.

More information on the metadata can be found in the README.txt file that comes with the data.

Contact Info


  • Preslav Nakov, Qatar Computing Research Institute, HBKU
  • Lluís Màrquez, Qatar Computing Research Institute, HBKU
  • Alessandro Moschitti, Qatar Computing Research Institute, HBKU
  • Hamdy Mubarak, Qatar Computing Research Institute, HBKU
  • Timothy Baldwin, The University of Melbourne
  • Doris Hoogeveen, The University of Melbourne
  • Karin Verspoor, The University of Melbourne

email : semeval-cqa@googlegroups.com

Other Info


  • 14 Feb. 2017: Submit your paper by February 27
  • 11 Feb. 2017: The results and all scores are released
  • 30 Jan. 2017: The closing date for test submissions is January 30th midnight UTC-12.
  • 24 Jan. 2017: Test set for subtask E is available now. (here)
  • 12 Jan. 2017: Test sets for subtasks A-D are available now. (data webpage)
  • 9 Jan. 2017: The release of test data for subtasks A-D is delayed by some days. Apologies for the inconvenience.
  • 5 Jan. 2017: Submission deadline is set to be January 30.
  • 5 Jan. 2017: New web page created with instructions on how to submit system results.
  • 8 Dec 2016: Separate competitions for the subtasks have been set up at CodaLab, where you can submit your results: Subtask A, Subtask B, Subtask C, Subtask D, and Subtask E. You can submit results both for the development set and the test set here, receive scores and choose what to publish on the leaderboard.
  • 8 Dec 2016: A new scorer is now available from the Data and Tools page, which can also be used for subtask E
  • Register to participate here