This is the readme file for building Arabic ASR using GALE database from LDC and The Kaldi Speech Recognition Toolkit
QCRI scripts build and test Arabic Broadcast news ASR
The test set is mix of conversational and report speech

About the GALE Phase 2 Arabic Broadcast Conversation:

LDC2013S02: http://catalog.ldc.upenn.edu/LDC2013S02
LDC2013S07: http://catalog.ldc.upenn.edu/LDC2013S07
LDC2013T17: http://catalog.ldc.upenn.edu/LDC2013T17
LDC2013T04: http://catalog.ldc.upenn.edu/LDC2013T04


GALE Phase 2 Arabic Broadcast Conversation Speech was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 200 hours of Arabic broadcast conversation speech collected in 2006 and 2007 by LDC as part of the DARPA GALE (Global Autonomous Language Exploitation) Program.

The data has two types of speech: conversational and report. This script trains and test on all of them and results are reported for each of them

The dictionary, and scripts can be obtained from QCRI portal: http://alt.qcri.org/


s5: The experiments here are based on the above corpus

### 
1- Install and compile Kaldi: http://kaldi.sourceforge.net/
2- Untar the gale_recipe.tar into egs folder: 
     tar xvf gale_recipe.tar -C kaldi-trunk/egs
3- Modify run.sh:
	a- Adjust the number of jobs accordingly, default nJobs=120
	   It will also depend if you use queue or local machine, look at cmd.sh
	b- Change the data settings to point to GALE database: 
	    example: LDC2013S02_1=/alt/data/speech/LDC/LDC2013S02/gale_p2_arb_bc_speech_p1_d1
4- Start run.sh 

The script will build GMM, GMM+MPE, GMM+bMMI, SGMM+fMLLR, and SGMM+bMMI 
The RESULTS file has the WER if you use the script along with QCRI pronunciation dictionary.