Annotated Al Jazeera Dialectal Speech Corpus
Abstract
This speech corpus contains dialect-level labels for 57 hours of dialectal Arabic speech (Egyptian, Levantine, North African, and Gulf) from Al Jazeera from between June 2014 and January 2015, as well as confidence levels those labels are based on. This corpus also contains 94 hours of dialectal Arabic speech automatically labeled by linking speaker information from the human-labeled set.
Related publications
- S. Wray and A. Ali, “Crowdsource a little to label a lot: Labeling a Speech Corpus of Dialectal Arabic,” in Interspeech, 2015.
[Bibtex]@InProceedings{wrayaliclassify, author = {Samantha Wray and Ahmed Ali}, title = {{Crowdsource a little to label a lot: Labeling a Speech Corpus of Dialectal Arabic}}, booktitle={Interspeech}, year = {2015}, note = {(in press)} }