QCRI Advanced Tools for ARAbic (QATARA) is a Library of a statistical Tokenizer, Part of Speech, Named Entities, Gender and Number Tagger, and a Diacritizer. The core engine for this library was trained using Conditional Random Fields (CRF++).
CRF have been used for segmenting/labeling sequential data among other NLP tasks.
You may check-out also the latest version from the githup repository: QATARA Repository
Data directory is not included. Download the data from Data.tgz
Check the QATARA online demo.
Package content: QATARA library.
Command line:
java -Xmx2048m -Djava.library.path=../data/ -cp ./lib/ArabicPOSTaggerLib.jar:./lib/CRFPP.jar:./lib/trove-3.0.3.jar:./lib/weka.jar:./dist/ProcessDiacritizedLexicon.jar:. QataraLib <options> < [file to parse]
Parameters:
QataraLib <--help|-h> [--task|-t pos|tok|ner|diac] [--klm|-k kenlmDir] < filename
* options:
* --help display help information
* --task tok : Parse file using tokenization model
* pos : Parse file using both tokenization and pos models
* ner : Parse file using tokenization, pos and named entities models
* diac: Diacritize text
*
* --klm kenlmdir : Directory with kenlm binary
*
Example:
java -Xmx2048m -Djava.library.path=../data/ -cp ./lib/ArabicPOSTaggerLib.jar:./lib/CRFPP.jar:./lib/trove-3.0.3.jar:./lib/weka.jar:./dist/ProcessDiacritizedLexicon.jar:. QataraLib -t diac < testfile.txt
For Windows Environment: You may require to explicitly specify the library path:
java -Xmx1024m -Djava.library.path=./data/ -cp ./lib/ArabicPOSTaggerLib.jar:./lib/CRFPP.jar:./lib/trove-3.0.3.jar:./lib/weka.jar:./dist/ProcessDiacritizedLexicon.jar:. QataraLib -t diac < testfile.txt
Build the jar:
ant jar
Deploy the package to other direcotory:
ant deploy -Do=<Dest Dir>
QATARA Arabic Text Analyzer used Java Native Interface (JNI) wrapper to access CRF++ functionalilies: Two files needed from the CRF++ which are:
You can download the source code for CRF++ from http://code.google.com/p/crfpp/ To build CRFPP and kenlm, See ArabicPOSTaggerLib documentation.
If you have any problem, question please contact kdarwish@qf.org.qa or aabdelali@qf.org.qa.
This code is made public for RESEARCH purpose only, except the binaries and libraries in the depadencies which
have their own licenses, listed below. See the references in these files for more details.
- CRF++
- KenLM
For the rest:
QATARA Library is being made public for research purpose only.
For non-research use, please contact:
Kareem M. Darwish < kdarwish@qf.org.qa>
Ahmed Abdelali <aabdelali@qf.org.qa>
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Other licenses can be requested.
Copyright Qatar Computing Research Institute. All rights reserved.