About

PrepOCRessor is developed at the Qatar Computing Research Institute for preprocessing document images for optical character recognition. The tool follows the pipeline paradigm in Unix-like operating systems: A set of image processing operations is chained such that the output of each operation serves as input to the next one. The tool supports batch processing for high parallelism and scalability. PrepOCRessor is intended to be used in combination with the recognition toolkit Kaldi and supports file formats for feature sets (.ark,t) and forced-alignments (.al) for a seamless integration. Even though we focus on Arabic script, the tool has been successfully used for other writing systems, e.g. Latin in the ICDAR2015 Competition HTRtS on historic documents.

Related publications

  • F. Stahlberg and S. Vogel. QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries. In DAS, 2016. [BibTeX]
    @inproceedings{qatip, title={{QATIP - An Optical Character Recognition System for Arabic Heritage Collections in Libraries}}, author={Stahlberg, F. and Vogel, S.}, booktitle={DAS}, year={2016}, organization={IAPR} }
  • F. Stahlberg and S. Vogel. The QCRI recognition system for handwritten Arabic. In ICIAP, 2015. [BibTeX]
    @inproceedings{qcriHWR15, title={{The QCRI Recognition System for Handwritten Arabic}}, author={Stahlberg, F. and Vogel, S.}, booktitle={ICIAP}, year={2015} }
  • F. Stahlberg and S. Vogel. Detecting dense foreground stripes in Arabic handwriting for accurate baseline positioning. In ICDAR, 2015. [BibTeX]
    @inproceedings{stripeBaselineDetection, title={{Detecting dense foreground stripes in Arabic handwriting for accurate baseline positioning}}, author={Stahlberg, F. and Vogel, S.}, booktitle={ICDAR}, year={2015}, organization={IEEE} }
  • F. Stahlberg and S. Vogel. Document skew detection based on Hough space derivatives. In ICDAR, 2015. [BibTeX]
    @inproceedings{houghDerivativeSkewDetection, title={{Document Skew Detection Based on Hough Space Derivatives}}, author={Stahlberg, F. and Vogel, S.}, booktitle={ICDAR}, year={2015}, organization={IEEE} }

Download

The following list contains a list of PrepOCRessor releases. The newest release is on the top of the list.

  • PrepOCRessor 0.2.1. Released on 1 September 2015. [Manual] [JavaDoc] [Changelog]
    • New option for vertTextSegmentation: -symmetric.
    • Bugfix: concat crashes on empty populations.
    • New option for featExtract: -centerFrames for vertically repositioned windows (Doetsch et. al., 2012).
    • New operation: ascendersTextLine for ascender based base line estimation for Arabic
    • New operation: visualizePageXml for visualizing line segmentations and confidences in page xml files
  • PrepOCRessor 0.2. Released on 24 June 2015. [Manual] [JavaDoc] [Changelog]
    • Added the cutWithAltecXml operation.
    • Added the extractConstantRegions operation which helps using PrepOCRessor for OCR in videos.
    • New options: -singlePopulation useful in combination with extractConstantRegions
    • New placeholder for path specifications: %unqatip for better integration with QATIP.
    • New option for houghTextLine operation: -useTruIfAvailable to use reference baseline in training.
    • Fix memory leaks in vertTextSegmentation, textSkewCorrection, and houghTextLine operation.
    • Reduce memory demand within operations by releasing matrices as soon as possible.
    • Fix java exception in vertTextSegmentation on nearly completely black images (output warning).
    • Don't use morphology operators in vertTextSegmentation for enforcing -minMargin and -minWidth as they cause problems at image borders.
    • New option: -transpose for vertTextSegmentation operation helps to keep track of rectangles while using vertTextSegmentation for horizontal segmentation.
    • renderPageXmlTranscriptions now searches for plain text entries in text regions if no low level text objects are available.
    • renderPageXmlTranscriptions can now supports UTF-8 and a new option -align for text alignment
    • New option -border in cutWithPageXml operation if the results are indended for human inspection.
    • New operation normalizePenSize for line thinning for OCR
    • New operation removeSmallComponents which can be used in combination with invert to remove holes.
    • Fix bug in line thinning with white areas at the image border
  • PrepOCRessor 0.1. Released on 7 May 2015. [Manual] [JavaDoc]

The latest source code can be accessed via the PrepOCRessor bitbucket repository.

Installation

The following instructions explain the installation on Debian-based systems like Ubuntu but can be easily extended for other platforms. The commands in this guide should work in standard Unix shells like zsh and bash. They were tested on Ubuntu 15.04. For more information we refer to the Manual.

  1. Install the Java runtime environment. PrepOCRessor was tested with Java 1.7 but should run with other versions as well. On Ubuntu, Java is installed by default. You can check the version number by typing java -version into your shell.
  2. Install the OpenCV library. PrepOCRessor was tested with OpenCV 2.4.10 but other 2.4.x versions are likely to work. Ubuntu provides out-of-the-box packages which can be installed with the following command: sudo apt-get install libopencv2.4-java If you are not using Ubuntu, you can download the latest OpenCV 2.4.x version and follow the installation instructions. For more information about the Java support of OpenCV, check the OpenCV Documentation.
  3. Download PrepOCRessor. The easiest way to get started with PrepOCRessor is to download the latest release and unzip it wherever you like to install PrepOCRessor. Alternatively, you can compile PrepOCRessor by yourself. The PrepOCRessor repository contains an Eclipse project with the source code.
  4. Configure PrepOCRessor. If you don't use Ubuntu Linux or you compiled OpenCV by yourself without using the Ubuntu packages, you need to tell PrepOCRessor where to find the OpenCV library. Open the prepocressor file in the installation root directory in your favourite text editor. You need to set the variables OPENCV_JAR_PATH and OPENCV_NATIVE_LIB. The variable OPENCV_JAR_PATH should point to the OpenCV .jar file. For example, in OpenCV 2.4.10 this file can be found within the OpenCV installation in <opencv-install-dir>/bin/opencv-2410.jar. If you don't find it, you may have compiled OpenCV without Java support. The OPENCV_NATIVE_LIB variable needs to contain the native library directory path (usually <opencv-install-dir>/lib). This directory should contain a file called libopencv_java2410.so or similar.
  5. Test PrepOCRessor installation. You can start PrepOCRessor by changing into the installation directory and type the following command into the shell: ./prepocressor -help This should output a list of global parameters together with a description for each of them. To test if the OpenCV library is installed and configured correctly, type ./prepocressor (i.e. without arguments). The output should be similar to this: 13:33:14 INFO: Configuration loaded...
    13:33:14 FATAL: Input file 'imageList.txt' reading
    error: imageList.txt (No such file or directory)
    If you get a significantly different output, consult the manual for troubleshooting.
  6. Make your shell aware of PrepOCRessor. This manual assumes that you have included PrepOCRessor in your $PATH environment variable so that you can start it with typing prepocressor into your shell. You can to this by writing the following line at the end of your ~/.bashrc: export PATH=$PATH:<prepocressor-install-dir> Alternatively, you can create a symlink to PrepOCRessor in a directory which is already in your $PATH variable. sudo ln -s /usr/local/bin/prepocressor <prepocressor-install-dir>/prepocressor

License

PrepOCRessor (Copyright 2015, QCRI a member of Qatar Foundation. All Rights Reserved) is licensed under the Apache License, Version 2.0 (the "License"); you may not use it except in compliance with the License. You may obtain a copy of the License here.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Support for the PAGE format is provided by the PRImA PAGE library, which is Apache 2.0 licensed software with copyright by the University of Salford, Manchester. PrepOCRessor releases contain copies of this library.

Mathematical optimization is provided by the Math component of the Apache Commons project developed by the Apache Software Foundation under the Apache 2.0 license. PrepOCRessor releases contain copies of this library.

The PrepOCRessor logo contains parts of the Kaldi logo. Kaldi uses the Apache 2.0 license.

The computer vision backend is based on OpenCV under the BSD license. The OpenCV library is not distributed with this software package and needs to be installed separately.