HBKU - QCRI
QCRI Arabic Dialects Identification (QADI) Corpus

Abstract

 QCRI Arabic Dialects Identification (QADI) is a Country-level Arabic dialects identification (DI) dataset. It provides a collection for benchmarking DI task.The dataset contains 540,590 tweets from 18 Arab countries. The data is distributed according to the following table:

Country* Train Test
AE 27,819 192
BH 28,295 184
DZ 17,603 170
EG 67,783 200
IQ 18,367 178
JO 34,109 180
KW 49,963 190
LB 38,386 194
LY 40,883 169
MA 12,813 178
OM 24,786 169
PL 48,641 173
QA 36,675 198
SA 35,396 199
SD 16,251 188
SY 18,317 194
TN 12,940 154
YE 11,563 193

* Country names are provided using ISO-3166-1 codes.

Related publications

  • A. Abdelali, H. Mubarak, Y. Samih, S. Hassan, and K. Darwish, “Arabic dialect identification in the wild.” 2020.
    [BibTeX]
    @inproceedings{abdelali2020arabic,
    title={Arabic Dialect Identification in the Wild},
    author={Ahmed Abdelali and Hamdy Mubarak and Younes Samih and Sabit Hassan and Kareem Darwish},
    year={2020},
    eprint={2005.06557},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
    }
Download
 
To download the data, after cloning the repository or downloading its content. The dataset files contain ids for all the tweets identified as from the designated country. you may use twarc or other Twitter Scraping tools to hydrate the tweets.
        twarc dataset/tweetsCountryID.txt
The tweets are arranged per country. Each file is a list of ids for the tweets from the designated country.
QADI Corpus Password protected: QADIQCRI
 
License
 
The QCRI Arabic Dialects Identification (QADI) Corpus (Copyright 2020, QCRI a member of Qatar Foundation. All Rights Reserved) is licensed under the Apache License, Version 2.0 (the "License"); you may not use it except in compliance with the License. You may obtain a copy of the License here.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

 QCRI Arabic Dialects Identification (QADI) is a Country-level Arabic dialects identification (DI) dataset. It provides a collection for benchmarking DI task.The dataset contains 540,590 tweets from 18 Arab countries. The data is distributed according to the following table: