Open Source OCR Engine

An OCR engine with unicode (UTF-8) support that can recognize over 100 languages out of the box.

Simple example

text <- ocr("")

Roundtrip test: render PDF to image and OCR it back to text

# A PDF file with some text
news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")
orig <- pdf_text(news)[1]
# Render pdf to jpeg/tiff image
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, "page.tiff")
# Extract text from images
out <- ocr("page.tiff")

On Windows and MacOS the package binary package can be installed from CRAN:


Installation from source on Linux or OSX requires the Tesseract library (see below).

On Debian or Ubuntu install libtesseract-dev and libleptonica-dev. Also install tesseract-ocr-eng to run english examples.

sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-langpack-eng

On Fedora and CentOS we need tesseract-devel and leptonica-devel

sudo yum install tesseract-devel leptonica-devel

On OS-X use tesseract from Homebrew:

brew install tesseract

Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other langauges you can to install the training data from your distribution. For example to install the spanish training data:

On other platforms you can manually download training data from github and store it in a path on disk that you pass in the datapath parameter. Alternatively you can set a default path via the TESSDATA_PREFIX environment variable.



  • Try to fix build for CRAN OS-X, again.


  • Try to fix build for CRAN OS-X build server
  • Show 'loaded' and 'available' languages in print.tesseract()


  • Initial CRAN release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.


1.4 by Jeroen Ooms, 9 days ago

Report a bug at

Browse source code at

Authors: Jeroen Ooms

Documentation:   PDF Manual  

Task views: Natural Language Processing

MIT + file LICENSE license

Imports Rcpp, curl, digest

Suggests magick, pdftools, tiff

Linking to Rcpp

System requirements: Tesseract >= 3.03 (libtesseract-dev / tesseract-devel) and Leptonica (libleptonica-dev / leptonica-devel). On Debian you need to install the English training data separately (tesseract-ocr-eng)

See at CRAN