QCRI Home Arabic Language Technologies ALT Server Tools Arabic POS Tagger Library

About

Arabic POS Tagger is a Library of a statistical Tokenizer, Part of Speech, Named Entities, Gender and Number Tagger, and a Diacritizer. The core engine for this library was trained using Conditional Random Fields (CRF++).
CRF have been used for segmenting/labeling sequential data among other NLP tasks.

Related publications

  • K. Darwish, A. Abdelali and H. Mubarak. Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging to appear in Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland, 2014. Pp. 2926-2931. [PDF] [BibTeX]
    @InProceedings{DARWISH14.335, author = {Kareem Darwish and Ahmed Abdelali and Hamdy Mubarak}, title = {Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging}, booktitle = {Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)}, year = {2014}, month = {may}, date = {26-31}, address = {Reykjavik, Iceland}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Hrafn Loftsson and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, isbn = {978-2-9517408-8-4}, language = {english} }

Download

You may check-out also the latest version from the githup repository: https://github.com/Qatar-Computing-Research-Institute/ArabicProcessingTools/tree/master/ArabicTextAnalyzer

Demonstration

Check the Arabic POS Tagger online demo.

Contents

Package content: Arabic POS Tagger library.

How to run the software

Command line:

java -Xmx800m -Djava.library.path=. -jar ArabicPOSTaggerLib.jar <options> < [file to parse]

Parameters:

ArabicPOSTaggerLib.jar <--help|-h> [--task|-t pos|tok|ner] [--klm|-k kenlmDir] < filename * options: * --help display help information * --task tok: Parse file using tokenization model * pos: Parse file using both tokenization and pos models * ner: Parse file using tokenization, pos and named entities models * * --klm kenlmdir : Directory with kenlm binary *

Example:

java -Xmx2800m -Djava.library.path=./data/ -jar ArabicPOSTaggerLib.jar -t pos < Tweets.txt

For Windows Environment: You may require to explicitly specify the library path:

java -Xmx1024m -Djava.library.path=./data/ -jar ArabicPOSTaggerLib.jar -t pos -k models < Tweets1.txt

2- Java API: See java documentation for class ArabicPOSTagger.testCase using the following methods:

public static void processFile(java.lang.String dataDirectory, java.lang.String kenlmDirectory, java.lang.String inputFile, java.lang.String outputFile, boolean bDenormalizeText, int mode) public static void processSTDIN(java.lang.String dataDirectory, java.lang.String kenlmDirectory, int mode, boolean bDenormalizeText)

How to compile the software

Build the jar:

ant jar

Deploy the package to other direcotory:

ant deploy -Do=<Dest Dir>

Dependencies

ArabicPOSTaggerLib Arabic Text Analyzer used Java Native Interface (JNI) wrapper to access CRF++ functionalilies: Two files needed from the CRF++ which are:

  • CRFPP.jar
  • and a platform depandent library
    • libCRFPP.jnilib -for Mac OS-
    • libCRFPP.so -for Linux 86_64-
    • CRFPP.dll -for Windows-

You can download the source code for CRF++ from http://code.google.com/p/crfpp/ More details about compiling and building CRF++ is provided below.

  • kenlm langauage model KenLM langauge model binary for querying. This is used used for denormalization using the inpur text as a query; denormalized text is generated. The source code could be downloaded from http://kheafield.com/code/kenlm/

To build CRFPP.jar:

cd <CRF++ MainDir>/java make

To build libCRFPP.jnilib:

cd <CRF++ MainDir>/java make g++ -dynamiclib -Wl,-undefined -Wl,dynamic_lookup -o libCRFPP.jnilib .libs/libcrfpp.o .libs/lbfgs.o .libs/param.o .libs/encoder.o .libs/feature.o .libs/feature_cache.o .libs/feature_index.o .libs/node.o .libs/path.o .libs/tagger.o java/CRFPP_wrap.o -lpthread -lm

To build libCRFPP.so (Linux):

cd <CRF++ MainDir>/java make g++ -fPIC -DPIC -shared -nostdlib /usr/lib/crti.o /usr/lib/gcc/i486-linux-gnu/4.4.3/crtbeginS.o .libs/libcrfpp.o .libs/lbfgs.o .libs/param.o .libs/encoder.o .libs/feature.o .libs/feature_cache.o .libs/feature_index.o .libs/node.o .libs/path.o .libs/tagger.o ./java/CRFPP_wrap.o -lpthread -L/usr/lib/gcc/i486-linux-gnu/4.4.3 -L/usr/lib -L/usr/lib/i486-linux-gnu -lstdc++ -lm -lc -lgcc_s /usr/lib/gcc/i486-linux-gnu/4.4.3/crtendS.o /usr/lib/crtn.o -O3 -mieee-fp -Wl,-soname -Wl,libcrfpp.so.0 -o libCRFPP.so

For Windows environment:

- Copy the Makefile.lib.msvc to <CRF++ MainDir> - access <CRF++ MainDir> - run "nmake -f Makefile.lib.msvc" to build CRFPP.dll - access <CRF++ MainDir>/java - run "nmake" or "javac org/chasen/crfpp/*.java" - jar cfv CRFPP.jar org/chasen/crfpp/*.class

Copy both files to the Arabic POS Tagger Library to root dir or other destination with the ArabicAnalyzer.jar

Contact

If you have any problem, question please contact kdarwish@qf.org.qa or aabdelali@qf.org.qa.

License

This code is made public for RESEARCH purpose only, except the binaries and libraries in the depadencies which have their own licenses, listed below. See the references in these files for more details.
- CRF++
- KenLM

For the rest:

QATARA Library is being made public for research purpose only. For non-research use, please contact:
Kareem M. Darwish < kdarwish@qf.org.qa>
Ahmed Abdelali <aabdelali@qf.org.qa>

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Other licenses can be requested.