Text Extraction, Rendering and Converting of PDF Documents

Utilities based on 'libpoppler' for extracting text, fonts, attachments and metadata from a PDF file. Also supports high quality rendering of PDF documents info PNG, JPEG, TIFF format, or into raw bitmap vectors for further processing in R.


News

1.8

  • Run configure script in bash

1.7

  • Change autobrew script to avoid dependency on xQuartz

1.6

  • pdf_render_page() and pdf_convert() gain argument to control 'antialias'
  • Small tweaks in pdf_text() for dealing with malformed pdf files

1.5

  • On Windows and MacOS we now bundle poppler-data to support non-latin text
  • Windows: Upgrade libpoppler to 0.61.0 from rwinlib
  • Windows: patch libpoppler bug that would cause pdf_convert() to generate corrupt files
  • PDF rendering errors are relayed via message() instead of warning()

1.4

  • Hide symbols in supported platforms
  • Upgrade libpoppler on Windows

1.3

  • Improve support for reading passworded and encyrpted pdf files (+ unit tests)
  • Support direct conversion from pdf to png, jpeg, tiff (+ unit tests)
  • Switch to Rcpp automatic symbol registration
  • Tweak autobrew script for legacy Mavericks builds

1.2

  • Fix autobrew for OSX Mavericks

1.1

  • Extract autobrew script to separate repo

1.0

  • Add workaround for poppler landscape truncation bug (fixes #7)

0.5

  • Rebuild poppler on Windows to support PDF rendering

0.4

  • Update Homebrew URL in configure script.
  • Fix autobrew (rename libopenjepg -> libopenjp2)
  • Update libpoppler 0.46 for Windows

0.3

  • Update libpoppler 0.42 for Windows
  • Use the COMPILED_BY variable on Windows to support R 3.3

0.2

  • Switch pdf_render_page to 1 based indexing
  • Fix for red/blue channel mixup in pdf_render_page
  • Update example to use local PDF file

0.1

  • Initial release

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("pdftools")

1.8 by Jeroen Ooms, 6 months ago


https://ropensci.org/blog/2016/03/01/pdftools-and-jeroen (blog) https://github.com/ropensci/pdftools#readme (devel) https://poppler.freedesktop.org (upstream)


Report a bug at https://github.com/ropensci/pdftools/issues


Browse source code at https://github.com/cran/pdftools


Authors: Jeroen Ooms [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports Rcpp

Suggests jpeg, png, webp, testthat

Linking to Rcpp

System requirements: Poppler C++ API: libpoppler-cpp-dev (deb) or poppler-cpp-devel (rpm). The unit tests also require the 'poppler-data' package (rpm/deb)


Imported by crminer, findR, fulltext, pdfsearch, rcoreoa, readtext, tesseract, textreadr.

Suggested by goldi, gridGraphics, hunspell, magick, spelling, staplr, tm.


See at CRAN