Bilingual Corpus of Arabic-English Parallel Tweets

Twitter users often post parallel tweets—tweets that contain the same content but are written in different languages. Parallel tweets can be an important resource for developing machine translation (MT) systems among other natural language processing (NLP) tasks. This resource is a result of a generic method for collecting parallel tweets. Using the method, we compiled a bilingual corpus of English-Arabic parallel tweets and a list of Twitter accounts who post English-Arabic tweets regularly. Additionally, we annotate a subset of Twitter accounts with their countries of origin and topic of interest, which provides insights about the population who post parallel tweets.

Related publications

  • H. Mubarak, S. Hassan, and A. Abdelali, “Constructing a bilingual corpus of parallel tweets,” in Proceedings of 13th workshop on building and using comparable corpora (bucc), Marseille, France, 2020.
    title={Constructing a Bilingual Corpus of Parallel Tweets},
    author={Mubarak, Hamdy and Hassan, Sabit and Abdelali, Ahmed},
    booktitle={Proceedings of 13th Workshop on Building and Using Comparable Corpora (BUCC)},
    address={Marseille, France},
The Bilingual Corpus of Arabic-English Parallel Tweets (Copyright 2020, QCRI a member of Qatar Foundation. All Rights Reserved) is licensed under the Apache License, Version 2.0 (the "License"); you may not use it except in compliance with the License. You may obtain a copy of the License here.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.