Site Logo

Farasa: Fast and Accurate Arabic Word Segmenter

Kareem Darwish and Hamdy Mubarak
{kdarwish,hmubarak}@qf.org.qa

Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar



Site Logo
Goal Build a state-of-the-art Arabic segmenter that is furiously FAST
Background

Arabic is a Semitic language with rich derivational morphology

  • Roots are fit into templates to generates stems
  • Prefixes and suffixes are attached to stems to generate words
  • Stems are the units of meaning
  • Not all combinations of root-patterns, stem-prefixes, stem-suffixes, and prefixes-suffixes are valid.

Segmenter takes a word and splits prefixes and suffixes:
Noun: wbktAbnA w+b+ ktAb+nA (and in our book)
Verb:fsynfqwnhA f+s+ynfq+wn+hA (so they will spend it)

Proper segmentation is critical for machine translation (MT) and Arabic Information Retrieval (IR).

Sample Output
Word Segmentation
تبين تبين
للعلماء ل+ال+علماء
أن أن
عشرين عشر+ين
دقيقة دقيق+ة
من من
الرياضة ال+رياض+ة
في في
اليوم ال+يوم
تساعد تساعد
على على
أبعاد أبعاد
اﻹنفلونزا ال+إنفلونزا
Procedure
Posed segmentation as a ranking problem (SVMRank)
  • Given word (without context), find all possible segmentation:
    • Valid prefixes: f, w, l, b, k, Al, s
    • Valid suffixes: A, p, t, k, n, w, y, At, An, wn, wA, yn, kmA, km, kn, h, hA, hmA, hm, hn, nA, tmA, tm, tn
Rank each possible segmentation based on the following features:
  • Probability(prefix| leading character sequence)
  • P(suffix| trailing character sequence)
  • Language Model P(stem): LM trained on 12 years of Aljazeera newswire articles (94 million words).
  • LM P(stem+1st suffix)
  • P(prefix| suffix) & P(suffix| prefix)
  • P(stem template) - template acquired from QATARA
  • Is stem in Lexicon:
    • Aljazeera corpus
    • Wikipedia gazetteer
    • AraComLex
    • Buckwalter stems
  • |stemlength - average(stemlength)|
Training and Testing
Training:
  • Arabic Penn Treebank (ATB) - parts 1, 2, and 3
  • 629k tokens (66k unique)
Testing:
  • 70 WikiNews articles (from 2013 and 2014)
    • Cover politics, economics, health, science and technologies, sports, art, and culture (10 per theme)
    • Contain 18,300 words
Experimental setups:
  • Farasabase: scores every word
  • Farasalookup: uses segmentations from training and scores unseen words
  • Compared to: MADAMIRA and QATARA
System Error Rate
MADAMIRA 1.24%
QATARA 1.77%
Farasabase 1.24%
Farasalookup 1.06%
Why Use Farasa?
Speed
    Speed test on 7.4 million words on i5 laptop:
    Farasabase: 129 sec
    Qatara: 18 min
    MADAMIRA: 2.5 hours
    Farasalookup can process one billion words in less than 5 hours
Integration
    100% Java
    Packeged into 1 jar file with no external dependencies
    Lucene and Moses integration available
Publications
  • Ahmed Abdelali, Kareem Darwish, Nadir Durrani, Hamdy Mubarak. 2016. Farasa: A Fast and Furious Segmenter for Arabic. NAACL-2016
  • Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A New Fast and Accurate Arabic Word Segmenter. LREC-2016.
  • Zhang, Yuan, Chengtao Li, Regina Barzilay, and Kareem Darwish. "Randomized Greedy Inference for Joint Segmentation, POS Tagging and Dependency Parsing." In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 42-52. 2015.
  • Kareem Darwish. 2013. Named Entity Recognition using Cross-lingual Resources: Arabic as an Example. ACL-2013.
  • Kareem Darwish, Wei Gao. 2014. Simple Effective Microblog Named Entity Recognition: Arabic as an Example. LREC-2014.