Farasa: Fast and Accurate Arabic Word Segmenter

Kareem Darwish and Hamdy Mubarak

Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar

Goal Build a state-of-the-art Arabic segmenter that is furiously FAST

Arabic is a Semitic language with rich derivational morphology

  • Roots are fit into templates to generates stems
  • Prefixes and suffixes are attached to stems to generate words
  • Stems are the units of meaning
  • Not all combinations of root-patterns, stem-prefixes, stem-suffixes, and prefixes-suffixes are valid.

Segmenter takes a word and splits prefixes and suffixes:
Noun: wbktAbnA w+b+ ktAb+nA (and in our book)
Verb:fsynfqwnhA f+s+ynfq+wn+hA (so they will spend it)

Proper segmentation is critical for machine translation (MT) and Arabic Information Retrieval (IR).

Sample Output
Word Segmentation
تبين تبين
للعلماء ل+ال+علماء
أن أن
عشرين عشر+ين
دقيقة دقيق+ة
من من
الرياضة ال+رياض+ة
في في
اليوم ال+يوم
تساعد تساعد
على على
أبعاد أبعاد
اﻹنفلونزا ال+إنفلونزا
Posed segmentation as a ranking problem (SVMRank)
  • Given word (without context), find all possible segmentation:
    • Valid prefixes: f, w, l, b, k, Al, s
    • Valid suffixes: A, p, t, k, n, w, y, At, An, wn, wA, yn, kmA, km, kn, h, hA, hmA, hm, hn, nA, tmA, tm, tn
Rank each possible segmentation based on the following features:
  • Probability(prefix| leading character sequence)
  • P(suffix| trailing character sequence)
  • Language Model P(stem): LM trained on 12 years of Aljazeera newswire articles (94 million words).
  • LM P(stem+1st suffix)
  • P(prefix| suffix) & P(suffix| prefix)
  • P(stem template) - template acquired from QATARA
  • Is stem in Lexicon:
    • Aljazeera corpus
    • Wikipedia gazetteer
    • AraComLex
    • Buckwalter stems
  • |stemlength - average(stemlength)|
Training and Testing
  • Arabic Penn Treebank (ATB) - parts 1, 2, and 3
  • 629k tokens (66k unique)
  • 70 WikiNews articles (from 2013 and 2014)
    • Cover politics, economics, health, science and technologies, sports, art, and culture (10 per theme)
    • Contain 18,300 words
Experimental setups:
  • Farasabase: scores every word
  • Farasalookup: uses segmentations from training and scores unseen words
  • Compared to: MADAMIRA and QATARA
System Error Rate
QATARA 1.77%
Farasabase 1.24%
Farasalookup 1.06%
Why Use Farasa?
    Speed test on 7.4 million words on i5 laptop:
    Farasabase: 129 sec
    Qatara: 18 min
    MADAMIRA: 2.5 hours
    Farasalookup can process one billion words in less than 5 hours
    100% Java
    Packeged into 1 jar file with no external dependencies
    Lucene and Moses integration available
