The QCRI Educational Domain (QED) Corpus.
Abstract
The QCRI Educational Domain Corpus (formerly QCRI AMARA Corpus) is an open multilingual collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. The current release of the corpus v1.4 contains 20 languages distributed over 44620 files.
Related publications
- A. Abdelali, F. Guzman, H. Sajjad, and S. Vogel, “The amara corpus: building parallel language resources for the educational domain,” in Proceedings of the ninth international conference on language resources and evaluation (lrec’14), Reykjavik, Iceland, 26-31 2014.
[BibTeX]@InProceedings{ABDELALI14.877, author = {Ahmed Abdelali and Francisco Guzman and Hassan Sajjad and Stephan Vogel}, title = {The AMARA Corpus: Building Parallel Language Resources for the Educational Domain}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} }
- F. Guzman, H. Sajjad, A. Abdelali, and S. Vogel, “The amara corpus: building resources for translating the web’s educational content,” in Proceedings of the international workshop on spoken language translation, iwslt, 2013.
[BibTeX]@inproceedings{guzman2013amara, title={The AMARA corpus: Building resources for translating the web's educational content}, author={Guzman, Francisco and Sajjad, Hassan and Abdelali, A and Vogel, S}, booktitle={Proceedings of the International Workshop on Spoken Language Translation, IWSLT}, volume={13}, year={2013} }
- D. Jansen, A. Alcala, and F. Guzman, “Amara: a sustainable, global solution for accessibility, powered by communities of volunteers,” in Universal access in human-computer interaction. design for all and accessibility practice, Springer, 2014, p. 401–411.
[BibTeX]@incollection{jansen2014amara, title={Amara: A sustainable, global solution for accessibility, powered by communities of volunteers}, author={Jansen, Dean and Alcala, Aleli and Guzman, Francisco}, booktitle={Universal Access in Human-Computer Interaction. Design for All and Accessibility Practice}, pages={401--411}, year={2014}, publisher={Springer} }
Download
The QED Corpus is being made available in two different arrangements:
- Machine Translation dataset: (IWSLT 2016 Permissible Data)
- data divided into training, development and testing (tst2014a and tst2014b) subsets, download(384Mb).
- Raw corpus:
- Original text files, organized by language and video id, download(105Mb).
Details
Table 1. Total number of parallel segments in the QED Corpus v1.4
en | sp | pt | zhs | zht | tr | pl | ar | fr | cz | ru | it | ja | kr | bg | de | th | nl | da | hi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
en | 2.5M | |||||||||||||||||||
sp | 335K | 479K | ||||||||||||||||||
pt | 231K | 117K | 291K | |||||||||||||||||
zhs | 139K | 52K | 46K | 281K | ||||||||||||||||
zht | 117K | 49K | 48K | 191K | 231K | |||||||||||||||
tr | 169K | 72K | 67K | 47K | 51K | 205K | ||||||||||||||
pl | 151K | 88K | 72K | 47K | 52K | 65K | 197K | |||||||||||||
ar | 158K | 83K | 73K | 58K | 59K | 90K | 69K | 185K | ||||||||||||
fr | 125K | 63K | 58K | 29K | 29K | 36K | 59K | 48K | 161K | |||||||||||
cz | 132K | 61K | 65K | 55K | 56K | 66K | 69K | 56K | 40K | 158K | ||||||||||
ru | 77K | 39K | 36K | 22K | 20K | 18K | 30K | 29K | 29K | 25K | 146K | |||||||||
it | 97K | 52K | 49K | 29K | 29K | 38K | 43K | 44K | 39K | 38K | 23K | 124K | ||||||||
ja | 98K | 44K | 44K | 46K | 42K | 47K | 39K | 48K | 31K | 43K | 23K | 29K | 113K | |||||||
kr | 83K | 36K | 36K | 30K | 32K | 28K | 32K | 37K | 25K | 30K | 14K | 28K | 26K | 109K | ||||||
bg | 79K | 39K | 44K | 28K | 33K | 45K | 44K | 39K | 33K | 42K | 15K | 27K | 21K | 19K | 100K | |||||
de | 77K | 49K | 45K | 22K | 25K | 30K | 45K | 36K | 42K | 35K | 22K | 30K | 24K | 23K | 28K | 99K | ||||
th | 85K | 50K | 40K | 31K | 29K | 56K | 38K | 45K | 21K | 29K | 14K | 22K | 27K | 15K | 20K | 20K | 95K | |||
nl | 73K | 43K | 42K | 25K | 29K | 33K | 41K | 40K | 33K | 41K | 19K | 31K | 25K | 22K | 25K | 30K | 19K | 85K | ||
da | 48K | 27K | 34K | 18K | 21K | 29K | 25K | 30K | 23K | 32K | 12K | 21K | 18K | 16K | 25K | 16K | 10K | 21K | 58K | |
hi | 43K | 26K | 22K | 14K | 14K | 17K | 24K | 25K | 16K | 17K | 8K | 26K | 16K | 14K | 13K | 13K | 13K | 17K | 15K | 48K |
Table 2. Results for systems trained, tuned and tested on the QED Corpus v1.4 for translating into English are provided below*:
BLEU NISTv13 | OOV | |||
---|---|---|---|---|
Source Lang. | tst2014a | tst2014b | tst2014a | tst2014b |
Spanish (sp) | 48.2 | 41.4 | 0.6% | 0.8% |
Portuguese (pt) | 52.1 | 46.6 | 0.7% | 0.9% |
Arabic (ar) | 38.0 | 34.4 | 1.0% | 1.2% |
Polish (pl) | 34.7 | 29.4 | 2.5% | 2.4% |
Czech (cz) | 33.7 | 32.9 | 2.3% | 2.6% |
French (fr) | 31.5 | 35.1 | 0.8% | 1.1% |
German (de) | 35.2 | 34.0 | 2.3% | 1.8% |
Russian (ru) | 34.3 | 38.6 | 1.8% | 1.7% |
Dutch (nl) | 39.8 | 45.6 | 1.3% | 1.4% |
Danish (da) | 40.5 | 35.3 | 2.5% | 2.6% |
*details about the SMT system configurations and settings are details in the paper.
Contacts
If you have any questions about the corpus, please direct your inquiries to Ahmed Abdelali or Francisco Guzman.
License
Developed by:
Qatar Computing Research Institute
Arabic Language Technologies Group
The QED Corpus is made public for RESEARCH purpose only.
The corpus is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE