QCRI Arabic Dialects Identification (QADI) Corpus
Abstract
QCRI Arabic Dialects Identification (QADI) is a Country-level Arabic dialects identification (DI) dataset. It provides a collection for benchmarking DI task.The dataset contains 540,590 tweets from 18 Arab countries. The data is distributed according to the following table:
Country* | Train | Test |
---|---|---|
AE | 27,819 | 192 |
BH | 28,295 | 184 |
DZ | 17,603 | 170 |
EG | 67,783 | 200 |
IQ | 18,367 | 178 |
JO | 34,109 | 180 |
KW | 49,963 | 190 |
LB | 38,386 | 194 |
LY | 40,883 | 169 |
MA | 12,813 | 178 |
OM | 24,786 | 169 |
PL | 48,641 | 173 |
QA | 36,675 | 198 |
SA | 35,396 | 199 |
SD | 16,251 | 188 |
SY | 18,317 | 194 |
TN | 12,940 | 154 |
YE | 11,563 | 193 |
* Country names are provided using ISO-3166-1 codes.
Related publications
- A. Abdelali, H. Mubarak, Y. Samih, S. Hassan, and K. Darwish, “Arabic dialect identification in the wild.” 2020.
[BibTeX]@inproceedings{abdelali2020arabic, title={Arabic Dialect Identification in the Wild}, author={Ahmed Abdelali and Hamdy Mubarak and Younes Samih and Sabit Hassan and Kareem Darwish}, year={2020}, eprint={2005.06557}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Download
To download the data, after cloning the repository or downloading its content. The dataset files contain ids for all the tweets identified as from the designated country. you may use twarc or other Twitter Scraping tools to hydrate the tweets.
twarc dataset/tweetsCountryID.txt
The tweets are arranged per country. Each file is a list of ids for the tweets from the designated country.
QADI Corpus Password protected: QADIQCRI
License
The QCRI Arabic Dialects Identification (QADI) Corpus (Copyright 2020, QCRI a member of Qatar Foundation. All Rights Reserved) is licensed under the Apache License, Version 2.0 (the "License"); you may not use it except in compliance with the License. You may obtain a copy of the License here.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
QCRI Arabic Dialects Identification (QADI) is a Country-level Arabic dialects identification (DI) dataset. It provides a collection for benchmarking DI task.The dataset contains 540,590 tweets from 18 Arab countries. The data is distributed according to the following table: