Translating into Arabic is tricky because the Arabic spelling is often inconsistent in terms of punctuation (using both Arabic UTF8 and English punctuation symbols), digits (appearing in both Arabic and Indian characters), diacritics (can be used or omitted, and can often be wrong), spelling (there are many errors in the spelling of some Arabic characters, esp. Alef and Ta Marbuta; also, Waa appears sometimes separated).
These problems are especially frequent in informal texts such as TED talks. What is worse is that they also happen in the references for the tuning and the testing sets (in addition to training).
Since these variations are quite random and depend on the style of the author of each piece of text, it does not make sense for a translation system to try to model them. Yet, they can affect evaluation score a lot!
Thus, below we provide a normalization script which can solve these issues for the Arabic references and for the system output. The script is supposed to be used for the final evaluation, i.e., one can train a model in any way, but then use this script before evaluation.
If one wants to also optimize towards proper Arabic, then the script can be also used to normalize the tuning and the training data.
This script was initially designed to be used for IWSLT'2013.
This script normalizes Arabic to make it consistent for the purpose of machine translation (MT) evaluation. It is to be run both on (i) the output of an MT system, and (ii) on the Arabic reference translation. It expects non-tokenized Arabic: it first normalizes it and then tokenizes it using the Europarl tokenizer (included in the package). The result is to be scored with a tool that performs no further tokenization (e.g., MultEval; the NIST scoring tool is NOT appropriate).
The script first concatenates back the conjunction 'Waa' when detached. It then uses MADA to normalize the following:
NOTE 1: Any English text in the middle of the Arabic text is left intact.
NOTE 2: Like any tool, occasionally, the script makes some mistakes. However, they are (i) rare and (ii) fairly consistent, and thus cause no problem for evaluation.
NOTE 3: We use MADA 3.2 with the dictionary of SAMA 3.1 or Aramorph 1.2.1
In order to be able to run the tool, you need to have installed MADA (we use version 3.2), the SRI language modeling tools, SVMtool, and SAMA (we use v.3.1) or Aramorph (we use v. 1.2.1).
You need to put the path to their installation in madaconfig.enr for SAMA or in madaconfig-aramorph1.2.1.enr for Aramorph.
MADA_HOME = ??
SRI_NGRAM_TOOL = ??
SVM_TAGGER = ??
ALMOR_DATABASE = ??
You also need to edit the path to MADA+TOKAN in qcri_normalizer_mada3.2_sama3.1.sh and qcri_normalizer_mada3.2_aramorph1.2.1.sh
MADA+SAMA:
./qcri_normalizer_mada3.2_sama3.1.sh <DETOK_ARABIC>
The result will be generated in a file named <DETOK_ARABIC>.mada_norm.europarl_tok
MADA+Aramorph:
./qcri_normalizer_mada3.2_aramorph1.2.1.sh <DETOK_ARABIC>
The result will be generated in a file named <DETOK_ARABIC>.mada_norm-aramorph.europarl_tok
The normalization script makes quite a difference in terms of BLEU score. For example, on the baseline output for IWSLT tst2010, it yields 2.3 BLEU points of difference (see also the /example/ directory). However, the difference between MADA+SAMA and MADA+Aramorph is negligible.
Compare:
$ ./multi-bleu.perl example/ref.europarl_tok < example/hyp.europarl_tok
BLEU = 9.61, 37.7/13.7/6.0/2.8 (BP=1.000, ratio=1.013, hyp_len=25073, ref_len=24755)
TO MADA+SAMA:
$ ./qcri_normalizer_mada3.2_sama3.1.sh example/hyp
$ ./qcri_normalizer_mada3.2_sama3.1.sh example/ref
$ ./multi-bleu.perl example/ref.mada_norm.europarl_tok < example/hyp.mada_norm.europarl_tok
BLEU = 11.93, 41.8/16.4/7.7/3.9 (BP=0.996, ratio=0.996, hyp_len=24562, ref_len=24671)
TO MADA_Aramorph:
$ ./qcri_normalizer_mada3.2_aramorph1.2.1.sh example/hyp
$ ./qcri_normalizer_mada3.2_aramorph1.2.1.sh example/ref
$ ./multi-bleu.perl example/ref.mada_norm-aramorph.europarl_tok < example/hyp.mada_norm-aramorph.europarl_tok
BLEU = 11.89, 41.8/16.4/7.7/3.9 (BP=0.996, ratio=0.996, hyp_len=24562, ref_len=24671)
Copyright Qatar Computing Research Institute. All rights reserved.