QCRI Arabic Dialects Identification (QADI) Corpus

Abstract

QCRI Arabic Dialects Identification (QADI) is a Country-level Arabic dialects identification (DI) dataset. It provides a collection for benchmarking DI task.The dataset contains 540,590 tweets from 18 Arab countries. The data is distributed according to the following table:

Country*	Train	Test
AE	27,819	192
BH	28,295	184
DZ	17,603	170
EG	67,783	200
IQ	18,367	178
JO	34,109	180
KW	49,963	190
LB	38,386	194
LY	40,883	169
MA	12,813	178
OM	24,786	169
PL	48,641	173
QA	36,675	198
SA	35,396	199
SD	16,251	188
SY	18,317	194
TN	12,940	154
YE	11,563	193

* Country names are provided using ISO-3166-1 codes.

Related publications

A. Abdelali, H. Mubarak, Y. Samih, S. Hassan, and K. Darwish, “Arabic dialect identification in the wild.” 2020.
[BibTeX]

@inproceedings{abdelali2020arabic,
title={Arabic Dialect Identification in the Wild},
author={Ahmed Abdelali and Hamdy Mubarak and Younes Samih and Sabit Hassan and Kareem Darwish},
year={2020},
eprint={2005.06557},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

Download

To download the data, after cloning the repository or downloading its content. The dataset files contain ids for all the tweets identified as from the designated country. you may use twarc or other Twitter Scraping tools to hydrate the tweets.

twarc dataset/tweetsCountryID.txt

The tweets are arranged per country. Each file is a list of ids for the tweets from the designated country.

QADI Corpus Password protected: QADIQCRI

License

The QCRI Arabic Dialects Identification (QADI) Corpus (Copyright 2020, QCRI a member of Qatar Foundation. All Rights Reserved) is licensed under the Apache License, Version 2.0 (the "License"); you may not use it except in compliance with the License. You may obtain a copy of the License here.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.