QCRI Arabic Normalizer

Arabic Language Technologies

ALT Server

Tools

About

Translating into Arabic is tricky because the Arabic spelling is often inconsistent in terms of punctuation (using both Arabic UTF8 and English punctuation symbols), digits (appearing in both Arabic and Indian characters), diacritics (can be used or omitted, and can often be wrong), spelling (there are many errors in the spelling of some Arabic characters, esp. Alef and Ta Marbuta; also, Waa appears sometimes separated).

These problems are especially frequent in informal texts such as TED talks. What is worse is that they also happen in the references for the tuning and the testing sets (in addition to training).

Since these variations are quite random and depend on the style of the author of each piece of text, it does not make sense for a translation system to try to model them. Yet, they can affect evaluation score a lot!

Thus, below we provide a normalization script which can solve these issues for the Arabic references and for the system output. The script is supposed to be used for the final evaluation, i.e., one can train a model in any way, but then use this script before evaluation.

If one wants to also optimize towards proper Arabic, then the script can be also used to normalize the tuning and the training data.

This script was initially designed to be used for IWSLT'2013.

Related publications

Hassan Sajjad, Francisco Guzman, Preslav Nakov, Ahmed Abdelali, Kenton Murray, Fahad Al Obaidli, and Stephan Vogel, QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic Spoken Language Translation. In Proceedings of the Tenth International Workshop on Spoken Language Translation (IWSLT'13). December 5-6, 2013, Heidelberg, Germany. [BibTeX]
@inproceedings{sajjadqcri, title={QCRI at IWSLT 2013: Experiments in Arabic-English and English-Arabic Spoken Language Translation}, author={Sajjad, Hassan and Guzm{\'a}n, Francisco and Nakov, Preslav and Abdelali, Ahmed and Murray, Kenton and Al Obaidli, Fahad and Vogel, Stephan} booktitle={Proceedings of the 10th International Workshop on Spoken Language Translation}, year={2013} }

Download

QCRI Arabic Normalizer 3.0. Released on 28 Nov 2013. [Changelog]
- Added an option to use Aramorph instead of SAMA.
QCRI Arabic Normalizer 2.0. [Changelog]
- Fixed a bug when putting back the English words.
- Fixed a bug with Waa at the beginning of a line.
- Tokenization now done as part of the main script: the expectation is for proper untokenized Arabic (even though tokenized Arabic might be fine too).
- Note added about the version of MADA and SAMA that are used.
- No output file is to be specified.
- Examples updated.
QCRI Arabic Normalizer 1.0.

Overview

This script normalizes Arabic to make it consistent for the purpose of machine translation (MT) evaluation. It is to be run both on (i) the output of an MT system, and (ii) on the Arabic reference translation. It expects non-tokenized Arabic: it first normalizes it and then tokenizes it using the Europarl tokenizer (included in the package). The result is to be scored with a tool that performs no further tokenization (e.g., MultEval; the NIST scoring tool is NOT appropriate).

The script first concatenates back the conjunction 'Waa' when detached. It then uses MADA to normalize the following:

punctuation: Arabic UTF8 punctuation is converted to English (otherwise, not recognized by the tokenizer), also, ".." becomes "...",
digits: Indian digits are converted to standard Arabic digits 0-9,
diacritics: dropped,
spelling: fixing the hamzas for the different forms of Alef, Alef Maqsura, Ta Marbuta, etc.

NOTE 1: Any English text in the middle of the Arabic text is left intact.

NOTE 2: Like any tool, occasionally, the script makes some mistakes. However, they are (i) rare and (ii) fairly consistent, and thus cause no problem for evaluation.

NOTE 3: We use MADA 3.2 with the dictionary of SAMA 3.1 or Aramorph 1.2.1

Installation

In order to be able to run the tool, you need to have installed MADA (we use version 3.2), the SRI language modeling tools, SVMtool, and SAMA (we use v.3.1) or Aramorph (we use v. 1.2.1).

You need to put the path to their installation in madaconfig.enr for SAMA or in madaconfig-aramorph1.2.1.enr for Aramorph.

MADA_HOME = ??
SRI_NGRAM_TOOL = ??
SVM_TAGGER = ??
ALMOR_DATABASE = ??

You also need to edit the path to MADA+TOKAN in qcri_normalizer_mada3.2_sama3.1.sh and qcri_normalizer_mada3.2_aramorph1.2.1.sh

Usage

MADA+SAMA:

./qcri_normalizer_mada3.2_sama3.1.sh <DETOK_ARABIC>

The result will be generated in a file named <DETOK_ARABIC>.mada_norm.europarl_tok

MADA+Aramorph:

./qcri_normalizer_mada3.2_aramorph1.2.1.sh <DETOK_ARABIC>

The result will be generated in a file named <DETOK_ARABIC>.mada_norm-aramorph.europarl_tok

Examples

The normalization script makes quite a difference in terms of BLEU score. For example, on the baseline output for IWSLT tst2010, it yields 2.3 BLEU points of difference (see also the /example/ directory). However, the difference between MADA+SAMA and MADA+Aramorph is negligible.

Compare:

$ ./multi-bleu.perl example/ref.europarl_tok < example/hyp.europarl_tok 
BLEU = 9.61, 37.7/13.7/6.0/2.8 (BP=1.000, ratio=1.013, hyp_len=25073, ref_len=24755)

TO MADA+SAMA:

$ ./qcri_normalizer_mada3.2_sama3.1.sh example/hyp
$ ./qcri_normalizer_mada3.2_sama3.1.sh example/ref
$ ./multi-bleu.perl example/ref.mada_norm.europarl_tok < example/hyp.mada_norm.europarl_tok 
BLEU = 11.93, 41.8/16.4/7.7/3.9 (BP=0.996, ratio=0.996, hyp_len=24562, ref_len=24671)

TO MADA_Aramorph:

$ ./qcri_normalizer_mada3.2_aramorph1.2.1.sh example/hyp
$ ./qcri_normalizer_mada3.2_aramorph1.2.1.sh example/ref
$ ./multi-bleu.perl example/ref.mada_norm-aramorph.europarl_tok < example/hyp.mada_norm-aramorph.europarl_tok
BLEU = 11.89, 41.8/16.4/7.7/3.9 (BP=0.996, ratio=0.996, hyp_len=24562, ref_len=24671)