HBKU - QCRI
The QCRI Educational Domain (QED) Corpus.

Abstract
 
The QCRI Educational Domain Corpus (formerly QCRI AMARA Corpus) is an open multilingual collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. The current release of the corpus v1.4 contains 20 languages distributed over 44620 files.
 
Related publications

  • A. Abdelali, F. Guzman, H. Sajjad, and S. Vogel, “The amara corpus: building parallel language resources for the educational domain,” in Proceedings of the ninth international conference on language resources and evaluation (lrec’14), Reykjavik, Iceland, 26-31 2014.
    [BibTeX]
    @InProceedings{ABDELALI14.877,
    author = {Ahmed Abdelali and Francisco Guzman and Hassan Sajjad and Stephan Vogel},
    title = {The AMARA Corpus: Building Parallel Language Resources for the Educational Domain},
    booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
    year = {2014},
    month = {may},
    date = {26-31},
    address = {Reykjavik, Iceland},
    editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
    publisher = {European Language Resources Association (ELRA)},
    isbn = {978-2-9517408-8-4},
    language = {english}
    }
  • F. Guzman, H. Sajjad, A. Abdelali, and S. Vogel, “The amara corpus: building resources for translating the web’s educational content,” in Proceedings of the international workshop on spoken language translation, iwslt, 2013.
    [BibTeX]
    @inproceedings{guzman2013amara,
    title={The AMARA corpus: Building resources for translating the web's educational content},
    author={Guzman, Francisco and Sajjad, Hassan and Abdelali, A and Vogel, S},
    booktitle={Proceedings of the International Workshop on Spoken Language Translation, IWSLT},
    volume={13},
    year={2013}
    }
  • D. Jansen, A. Alcala, and F. Guzman, “Amara: a sustainable, global solution for accessibility, powered by communities of volunteers,” in Universal access in human-computer interaction. design for all and accessibility practice, Springer, 2014, p. 401–411.
    [BibTeX]
    @incollection{jansen2014amara,
    title={Amara: A sustainable, global solution for accessibility, powered by communities of volunteers},
    author={Jansen, Dean and Alcala, Aleli and Guzman, Francisco},
    booktitle={Universal Access in Human-Computer Interaction. Design for All and Accessibility Practice},
    pages={401--411},
    year={2014},
    publisher={Springer}
    }
Download
 

The QED Corpus is being made available in two different arrangements:

  • Machine Translation dataset: (IWSLT 2016 Permissible Data)
    • data divided into training, development and testing (tst2014a and tst2014b) subsets, download(384Mb).
  • Raw corpus:
    • Original text files, organized by language and video id, download(105Mb).

 

Details

Table 1. Total number of parallel segments in the QED Corpus v1.4

en sp pt zhs zht tr pl ar fr cz ru it ja kr bg de th nl da hi
en 2.5M
sp 335K 479K
pt 231K 117K 291K
zhs 139K 52K 46K 281K
zht 117K 49K 48K 191K 231K
tr 169K 72K 67K 47K 51K 205K
pl 151K 88K 72K 47K 52K 65K 197K
ar 158K 83K 73K 58K 59K 90K 69K 185K
fr 125K 63K 58K 29K 29K 36K 59K 48K 161K
cz 132K 61K 65K 55K 56K 66K 69K 56K 40K 158K
ru 77K 39K 36K 22K 20K 18K 30K 29K 29K 25K 146K
it 97K 52K 49K 29K 29K 38K 43K 44K 39K 38K 23K 124K
ja 98K 44K 44K 46K 42K 47K 39K 48K 31K 43K 23K 29K 113K
kr 83K 36K 36K 30K 32K 28K 32K 37K 25K 30K 14K 28K 26K 109K
bg 79K 39K 44K 28K 33K 45K 44K 39K 33K 42K 15K 27K 21K 19K 100K
de 77K 49K 45K 22K 25K 30K 45K 36K 42K 35K 22K 30K 24K 23K 28K 99K
th 85K 50K 40K 31K 29K 56K 38K 45K 21K 29K 14K 22K 27K 15K 20K 20K 95K
nl 73K 43K 42K 25K 29K 33K 41K 40K 33K 41K 19K 31K 25K 22K 25K 30K 19K 85K
da 48K 27K 34K 18K 21K 29K 25K 30K 23K 32K 12K 21K 18K 16K 25K 16K 10K 21K 58K
hi 43K 26K 22K 14K 14K 17K 24K 25K 16K 17K 8K 26K 16K 14K 13K 13K 13K 17K 15K 48K

Table 2. Results for systems trained, tuned and tested on the QED Corpus v1.4 for translating into English are provided below*:

BLEU NISTv13 OOV
Source Lang. tst2014a tst2014b tst2014atst2014b
Spanish (sp) 48.2 41.4 0.6% 0.8%
Portuguese (pt) 52.1 46.6 0.7% 0.9%
Arabic (ar) 38.0 34.4 1.0% 1.2%
Polish (pl) 34.7 29.4 2.5% 2.4%
Czech (cz) 33.7 32.9 2.3% 2.6%
French (fr) 31.5 35.1 0.8% 1.1%
German (de) 35.2 34.0 2.3% 1.8%
Russian (ru) 34.3 38.6 1.8% 1.7%
Dutch (nl) 39.8 45.6 1.3% 1.4%
Danish (da) 40.5 35.3 2.5% 2.6%

*details about the SMT system configurations and settings are details in the paper.

Contacts
 
If you have any questions about the corpus, please direct your inquiries to Ahmed Abdelali or Francisco Guzman.
 
License
 
Developed by:
Qatar Computing Research Institute
Arabic Language Technologies Group
The QED Corpus is made public for RESEARCH purpose only.
The corpus is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE