An OCR engine with unicode (UTF-8) support that can recognize over 100 languages out of the box.
text <- ocr("")cat(text)
Roundtrip test: render PDF to image and OCR it back to text
library(pdftools)library(tiff)# A PDF file with some textsetwd(tempdir())news <- file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf")orig <- pdf_text(news)# Render pdf to jpeg/tiff imagebitmap <- pdf_render_page(news, dpi = 300)tiff::writeTIFF(bitmap, "page.tiff")# Extract text from imagesout <- ocr("page.tiff")cat(out)
On Windows and MacOS the package binary package can be installed from CRAN:
Installation from source on Linux or OSX requires the
Tesseract library (see below).
sudo apt-get install -y libtesseract-dev libleptonica-dev tesseract-langpack-eng
sudo yum install tesseract-devel leptonica-devel
On OS-X use tesseract from Homebrew:
brew install tesseract
Tesseract uses training data to perform OCR. Most systems default to English training data. To improve OCR performance for other langauges you can to install the training data from your distribution. For example to install the spanish training data:
On other platforms you can manually download training data from github
and store it in a path on disk that you pass in the
datapath parameter. Alternatively
you can set a default path via the
TESSDATA_PREFIX environment variable.