HBKU - QCRI
Resources that been created by our team!

We have created multiple packages that have been used with various projects.

 

Modern Standard Arabic Pronunciation Lexicon

There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don't look even slightly believable

View More
Kaldi Gale Recipe

This package includes files for building Arabic ASR using the GALE database from LDC and the Kaldi Speech Recognition Toolkit. The test set is a mix of conversational and report speech

View More
QCRI Educational Domain (QED) Corpus

The QED Corpus is an open multilingual collection of subtitles for educational videos and lectures collaboratively transcribed and translated over the AMARA web-based platform. The current release of the QED Corpus v1.4 contain 20 languages distributed over 44620 files.

View More
Annotated Al Jazeera Dialectal Speech Corpus

This corpus contains speech from Al Jazeera with both human-annotated and automatically-assigned labels for MSA and four major dialect groups (Egyptian, Levantine, North African, Gulf).

View More
Bilingual Corpus of Parallel Tweets

A collection of parallel Arabic-English tweets and an additional list of Twitter accounts that post parallel tweets.

View More
Arabic Fact-Checking and Stance Detection Corpus

Rationale, relevant document retrieval and fact checking. The corpus contains 422 claims that are made about the war in Syria and related Middle East political issues, where each claim is labeled for factuality, indicating whether they are True or False

View More
WAW Corpus

WAW Corpus is a bilingual translation and interpretation corpus in Arabic and English. WAW corpus comprises recordings from three international conferences namely WISE 2013, ARC’14 and WISH. These recordings contains both original speaker and the interpreter; their tanscripts and their translation..

View More
QCRI Arabic Dialects Identification (QADI) Corpus

QCRI Arabic Dialects Identification (QADI) is a Country level Arabic dialects identification (DI) dataset. It provides a collection for benchmarking DI task.

View More
Tanbih

The Tanbih mega-project aims to limit the effect of 'fake news', propaganda and media bias by making users aware of what they are reading. The team believes that promoting media literacy and critical thinking is the best way to address disinformation and 'fake news'.

View More
QCRI Dialectal Arabic Resources

A list of resources for dailectal Arabic open to researchers. These resources have been compiled at QCRI for research purposes and pilot experiments for various Arabic dialects.

View More
QATIP

Continuous text recognition and works best for entire pages of historic documents with a challenging script.

View More
AraBench

AraBench offers 4 coarse, 15 fine-grained and 25 city-level dialect categories, belonging to diverse genres, such as media, chat, religion and travel with varying level of dialectness.

View More
Conferences contribution

We have contribute to many conferences happend worldwide 

International Workshop on Semantic Evaluation

Barcelona, Spain

Collocated with The 28th International Conference on Computational Linguistics (COLING-2020).

International Workshop on Semantic Evaluation

Minneapolis, USA

collocated with the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019).

International Workshop on Semantic Evaluation

New Orleans, LA, USA

Collocated with the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2018)

International Workshop on Semantic Evaluation (SemEval-2014)

Vancouver, Canada

Collocated with the 55th annual meeting of the Association for Computational Linguistics (ACL)

International Workshop on Semantic Evaluation

Denver, Colorado

Collocated with NAACL-2015

International Workshop on Semantic Evaluation (SemEval-2014)

Dublin, Ireland

Collocated with COLING and *Sem