Arabic Language Technologies Team - Tools

Tools

KeLP - a Kernel-Based Learning Platform

KeLP allows to build complex kernel machine based systems, leveraging on the Java language and on a JSON interface to store and load classifiers configurations as well as to save the models to be reused. It includes different online and batch learning algorithms as well as several kernel functions, ranging from vector-based to structural kernels.
» Go to page

Farasa Arabic Text Processing Library

Farasa (means “insight” in Arabic), is a fast and accurate text processing toolkit for Arabic text. Farasa consists of a segmentation/tokenization module, POS tagger, Arabic text Diacritizer, and Dependency Parser. The core component of Farasa is the segmentation/tokenization module which is based on SVM-rank. The linear kernels used in the SVM uses a variety of features and lexicons to rank possible segmentations of a word. The features include: likelihoods of stems, prefixes, suffixes, their combinations; presence in lexicons containing valid stems or named entities; and underlying stem templates.

» Go to page

QCRI Arabic Normalizer

Translating into Arabic is tricky because the Arabic spelling is often inconsistent in terms of punctuation (using both Arabic UTF8 and English punctuation symbols), digits (appearing in both Arabic and Indian characters), diacritics (can be used or omitted, and can often be wrong), spelling (there are many errors in the spelling of some Arabic characters, esp. Alef and Ta Marbuta; also, Waa appears sometimes separated). This script normalizes Arabic to make it consistent for the purpose of machine translation (MT) evaluation.
» Go to page

A Document-level Discourse Parser

This package includes:

A discourse segmenter
A discourse parser

» Go to page

Evaluation Metrics for Discourse Parsing

Implementation of the standard evaluation metrics as described in Dan Marcu's book.
» Go to page

Speech act recognizer for synchronous and asynchronous conversations

This package includes:

A bi-directional LSTM for speech act recognition (theano, keras)
A global CRF model for thread-level inference (Matlab)

» Go to page

PrepOCRessor - Preprocessing for Arabic OCR

PrepOCRessor is a tool for preprocessing document images for optical character recognition. The tool follows the pipeline paradigm in Unix-like operating systems: A set of image processing operations is chained such that the output of each operation serves as input to the next one. The tool supports batch processing for high parallelism and scalability. Even though we focus on Arabic script, the tool has been successfully used for other writing systems, e.g. Latin in the ICDAR2015 Competition HTRtS on historic documents.
» Go to page