Annotated Al Jazeera Dialectal Speech Corpus

Abstract

This speech corpus contains dialect-level labels for 57 hours of dialectal Arabic speech (Egyptian, Levantine, North African, and Gulf) from Al Jazeera from between June 2014 and January 2015, as well as confidence levels those labels are based on. This corpus also contains 94 hours of dialectal Arabic speech automatically labeled by linking speaker information from the human-labeled set.

Related publications

S. Wray and A. Ali, “Crowdsource a little to label a lot: Labeling a Speech Corpus of Dialectal Arabic,” in Interspeech, 2015.
[Bibtex]

@InProceedings{wrayaliclassify,
author = {Samantha Wray and Ahmed Ali},
title = {{Crowdsource a little to label a lot: Labeling a Speech Corpus of Dialectal Arabic}},
booktitle={Interspeech},
year = {2015},
note = {(in press)}
}

Download

Annotated Al Jazeera Dialectal Speech Corpus (human-labeled subset) Released on 6 July 2015.
Annotated Al Jazeera Dialectal Speech Corpus (automatically-labeled expanded set). Released on 6 July 2015.