The QCRI Educational Domain (QED) Corpus

The QCRI Educational Domain (QED) Corpus.

Abstract

The QCRI Educational Domain Corpus (formerly QCRI AMARA Corpus) is an open multilingual collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. The current release of the corpus v1.4 contains 20 languages distributed over 44620 files.

Related publications

A. Abdelali, F. Guzman, H. Sajjad, and S. Vogel, “The amara corpus: building parallel language resources for the educational domain,” in Proceedings of the ninth international conference on language resources and evaluation (lrec’14), Reykjavik, Iceland, 26-31 2014.
[BibTeX]

@InProceedings{ABDELALI14.877,
author = {Ahmed Abdelali and Francisco Guzman and Hassan Sajjad and Stephan Vogel},
title = {The AMARA Corpus: Building Parallel Language Resources for the Educational Domain},
booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)},
year = {2014},
month = {may},
date = {26-31},
address = {Reykjavik, Iceland},
editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
publisher = {European Language Resources Association (ELRA)},
isbn = {978-2-9517408-8-4},
language = {english}
}

F. Guzman, H. Sajjad, A. Abdelali, and S. Vogel, “The amara corpus: building resources for translating the web’s educational content,” in Proceedings of the international workshop on spoken language translation, iwslt, 2013.
[BibTeX]

@inproceedings{guzman2013amara,
title={The AMARA corpus: Building resources for translating the web's educational content},
author={Guzman, Francisco and Sajjad, Hassan and Abdelali, A and Vogel, S},
booktitle={Proceedings of the International Workshop on Spoken Language Translation, IWSLT},
volume={13},
year={2013}
}

D. Jansen, A. Alcala, and F. Guzman, “Amara: a sustainable, global solution for accessibility, powered by communities of volunteers,” in Universal access in human-computer interaction. design for all and accessibility practice, Springer, 2014, p. 401–411.
[BibTeX]

@incollection{jansen2014amara,
title={Amara: A sustainable, global solution for accessibility, powered by communities of volunteers},
author={Jansen, Dean and Alcala, Aleli and Guzman, Francisco},
booktitle={Universal Access in Human-Computer Interaction. Design for All and Accessibility Practice},
pages={401--411},
year={2014},
publisher={Springer}
}

Download

The QED Corpus is being made available in two different arrangements:

Machine Translation dataset: (IWSLT 2016 Permissible Data)
- data divided into training, development and testing (tst2014a and tst2014b) subsets, download(384Mb).

Raw corpus:
- Original text files, organized by language and video id, download(105Mb).

Details

Table 1. Total number of parallel segments in the QED Corpus v1.4

	en	sp	pt	zhs	zht	tr	pl	ar	fr	cz	ru	it	ja	kr	bg	de	th	nl	da	hi
en	2.5M
sp	335K	479K
pt	231K	117K	291K
zhs	139K	52K	46K	281K
zht	117K	49K	48K	191K	231K
tr	169K	72K	67K	47K	51K	205K
pl	151K	88K	72K	47K	52K	65K	197K
ar	158K	83K	73K	58K	59K	90K	69K	185K
fr	125K	63K	58K	29K	29K	36K	59K	48K	161K
cz	132K	61K	65K	55K	56K	66K	69K	56K	40K	158K
ru	77K	39K	36K	22K	20K	18K	30K	29K	29K	25K	146K
it	97K	52K	49K	29K	29K	38K	43K	44K	39K	38K	23K	124K
ja	98K	44K	44K	46K	42K	47K	39K	48K	31K	43K	23K	29K	113K
kr	83K	36K	36K	30K	32K	28K	32K	37K	25K	30K	14K	28K	26K	109K
bg	79K	39K	44K	28K	33K	45K	44K	39K	33K	42K	15K	27K	21K	19K	100K
de	77K	49K	45K	22K	25K	30K	45K	36K	42K	35K	22K	30K	24K	23K	28K	99K
th	85K	50K	40K	31K	29K	56K	38K	45K	21K	29K	14K	22K	27K	15K	20K	20K	95K
nl	73K	43K	42K	25K	29K	33K	41K	40K	33K	41K	19K	31K	25K	22K	25K	30K	19K	85K
da	48K	27K	34K	18K	21K	29K	25K	30K	23K	32K	12K	21K	18K	16K	25K	16K	10K	21K	58K
hi	43K	26K	22K	14K	14K	17K	24K	25K	16K	17K	8K	26K	16K	14K	13K	13K	13K	17K	15K	48K

Table 2. Results for systems trained, tuned and tested on the QED Corpus v1.4 for translating into English are provided below^*:

	BLEU NISTv13		OOV
Source Lang.	tst2014a	tst2014b	tst2014a	tst2014b
Spanish (sp)	48.2	41.4	0.6%	0.8%
Portuguese (pt)	52.1	46.6	0.7%	0.9%
Arabic (ar)	38.0	34.4	1.0%	1.2%
Polish (pl)	34.7	29.4	2.5%	2.4%
Czech (cz)	33.7	32.9	2.3%	2.6%
French (fr)	31.5	35.1	0.8%	1.1%
German (de)	35.2	34.0	2.3%	1.8%
Russian (ru)	34.3	38.6	1.8%	1.7%
Dutch (nl)	39.8	45.6	1.3%	1.4%
Danish (da)	40.5	35.3	2.5%	2.6%

^*details about the SMT system configurations and settings are details in the paper.

Contacts

If you have any questions about the corpus, please direct your inquiries to Ahmed Abdelali or Francisco Guzman.

License

Developed by:

Qatar Computing Research Institute

Arabic Language Technologies Group

The QED Corpus is made public for RESEARCH purpose only.

The corpus is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE