KeLP allows to build complex kernel machine based systems, leveraging on the Java language and on a JSON interface to store and load classifiers configurations as well as to save the models to be reused. It includes different online and batch learning algorithms as well as several kernel functions, ranging from vector-based to structural kernels.
» Go to page
Farasa (means “insight” in Arabic), is a fast and accurate text processing toolkit for Arabic text. Farasa consists of a segmentation/tokenization module, POS tagger, Arabic text Diacritizer, and Dependency Parser. The core component of Farasa is the segmentation/tokenization module which is based on SVM-rank. The linear kernels used in the SVM uses a variety of features and lexicons to rank possible segmentations of a word. The features include: likelihoods of stems, prefixes, suffixes, their combinations; presence in lexicons containing valid stems or named entities; and underlying stem templates.
Translating into Arabic is tricky because the Arabic spelling is often inconsistent in terms of
punctuation (using both Arabic UTF8 and English punctuation symbols),
digits (appearing in both Arabic and Indian characters),
diacritics (can be used or omitted, and can often be wrong),
spelling (there are many errors in the spelling of some Arabic characters, esp. Alef and Ta Marbuta; also, Waa appears sometimes separated).
This script normalizes Arabic to make it consistent for the purpose of machine translation (MT) evaluation.
» Go to page
This package includes:
Implementation of the standard evaluation metrics as described in Dan Marcu's book.
» Go to page
This package includes:
PrepOCRessor is a tool for preprocessing document images for optical character recognition. The tool follows the pipeline paradigm in Unix-like operating systems: A set of image processing operations is chained such that the output of each operation serves as input to the next one. The tool supports batch processing for high parallelism and scalability. Even though we focus on Arabic script, the tool has been successfully used for other writing systems, e.g. Latin in the ICDAR2015 Competition HTRtS on historic documents.
» Go to page
Copyright Qatar Computing Research Institute. All rights reserved.