We present our live speech Arabic dialect identification system; QCRI-MIT Advanced Dialect Identification System (QMDIS). Our demo features modern web technologies to capture live audio, and broadcasts Arabic transcriptions along with the corresponding dialect simultaneously. The detected dialect is visualized using light map, where the intensity of the color reflects the probability of the dialect. We also integrate meter bars to display live the probability for each dialect per sentence. Our demo is publicly available at dialectid.qcri.org.
The task of spoken dialect identification consists of classifying a given spoken utterance into one of the many dialects in a particular language. Arabic Dialect Identification (ADI) is similar to the more general problem of Language Identification (LID). ADI is more challenging than LID because of the small and subtle differences between the various dialects of the same language. A good ADI system can be used to extract dialectal data from the speech database to train dialect specific acoustic models for speech-to-text transcription. It can also be used for meta-data enrichment.
QMDIS live demo continues our investigation on ADI. The Arabic language can be broadly divided into five major dialects; namely Egyptian (EGY), Gulf (GLF) or Arabian Peninsula, Levantine (LAV), Modern Standard Arabic (MSA) and North African (NOR) or Maghrebi. As argued in our publications, there are sufficient differences between the various Arabic dialects such that they can be treated as different languages, and the problem is similar to that of LID. We make the same assumption in this demo.
Our best results for ADI gives us an accuracy of 78% overall accuracy across the five dialects using the 2017 Multi-Genre Broadcast challenge (MGB-3) data. In order to achieve a robust dialect identification, we explored using Siamese neural network models to learn similarity and dissimilarities among Arabic dialects, as well as i-vector post-processing to adapt domain mismatches. Both acoustic and linguistic features were used. However, for the live demo, we limit our system to the lexical features extracted from our Arabic speech recognition system. Our system detects the spoken dialect using the most recent 10 words and aggregate the results with sliding window every time the ASR system marks the text output as final, which is typically happens when there is at least a silence of 500m sec. By using the most recent ten words, our demo is capable of dealing with code-switching between Dialectal Arabic (DA) and MSA, which happens often in Arabic speech as argued in our QMDIS Interspeech 2017 paper.
- Mohamed Eldesouki, Suwon Shon, and Ahmed Ali, (2018), QCRI-MIT Live Arabic Dialect Identification System, ICASSP, Calgary, Canada [DEMO Paper]
- Shon, Suwon, Ahmed Ali, and James Glass. “Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition.” arXiv preprint arXiv:1803.04567 (2018).
- Shon, Suwon, Ahmed Ali, and James Glass. “MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge.” Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE. IEEE, 2017.