Read Excel Files

Import excel files into R. Supports '.xls' via the embedded 'libxls' C library < https://github.com/evanmiller/libxls> and '.xlsx' via the embedded 'RapidXML' C++ library < https://rapidxml.sourceforge.net>. Works on Windows, Mac and Linux without external dependencies.


Travis-CI Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge

Overview

The readxl package makes it easy to get data out of Excel and into R. Compared to many of the existing packages (e.g. gdata, xlsx, xlsReadWrite) readxl has no external dependencies, so it's easy to install and use on all operating systems. It is designed to work with tabular data.

readxl supports both the legacy .xls format and the modern xml-based .xlsx format. The libxls C library is used to support .xls, which abstracts away many of the complexities of the underlying binary format. To parse .xlsx, we use the RapidXML C++ library.

Installation

The easiest way to install the latest released version from CRAN is to install the whole tidyverse.

install.packages("tidyverse")

NOTE: you will still need to load readxl explicitly, because it is not a core tidyverse package loaded via library(tidyverse).

Alternatively, install just readxl from CRAN:

install.packages("readxl")

Or install the development version from GitHub:

devtools::install_github("tidyverse/readxl")

Usage

library(readxl)

readxl includes several example files, which we use throughout the documentation. Use the helper readxl_example() with no arguments to list them or call it with an example filename to get the path.

readxl_example()
#>  [1] "clippy.xls"    "clippy.xlsx"   "datasets.xls"  "datasets.xlsx"
#>  [5] "deaths.xls"    "deaths.xlsx"   "geometry.xls"  "geometry.xlsx"
#>  [9] "type-me.xls"   "type-me.xlsx"
readxl_example("clippy.xls")
#> [1] "/Users/jenny/resources/R/library/readxl/extdata/clippy.xls"

read_excel() reads both xls and xlsx files and detects the format from the extension.

xlsx_example <- readxl_example("datasets.xlsx")
read_excel(xlsx_example)
#> # A tibble: 150 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> # ... with 147 more rows
 
xls_example <- readxl_example("datasets.xls")
read_excel(xls_example)
#> # A tibble: 150 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> # ... with 147 more rows

List the sheet names with excel_sheets().

excel_sheets(xlsx_example)
#> [1] "iris"     "mtcars"   "chickwts" "quakes"

Specify a worksheet by name or number.

read_excel(xlsx_example, sheet = "chickwts")
#> # A tibble: 71 × 2
#>   weight      feed
#>    <dbl>     <chr>
#> 1    179 horsebean
#> 2    160 horsebean
#> 3    136 horsebean
#> # ... with 68 more rows
read_excel(xls_example, sheet = 4)
#> # A tibble: 1,000 × 5
#>      lat   long depth   mag stations
#>    <dbl>  <dbl> <dbl> <dbl>    <dbl>
#> 1 -20.42 181.62   562   4.8       41
#> 2 -20.62 181.03   650   4.2       15
#> 3 -26.00 184.10    42   5.4       43
#> # ... with 997 more rows

There are various ways to control which cells are read. You can even specify the sheet here, if providing an Excel-style cell range.

read_excel(xlsx_example, n_max = 3)
#> # A tibble: 3 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
read_excel(xlsx_example, range = "C1:E4")
#> # A tibble: 3 × 3
#>   Petal.Length Petal.Width Species
#>          <dbl>       <dbl>   <chr>
#> 1          1.4         0.2  setosa
#> 2          1.4         0.2  setosa
#> 3          1.3         0.2  setosa
read_excel(xlsx_example, range = cell_rows(1:4))
#> # A tibble: 3 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
read_excel(xlsx_example, range = cell_cols("B:D"))
#> # A tibble: 150 × 3
#>   Sepal.Width Petal.Length Petal.Width
#>         <dbl>        <dbl>       <dbl>
#> 1         3.5          1.4         0.2
#> 2         3.0          1.4         0.2
#> 3         3.2          1.3         0.2
#> # ... with 147 more rows
read_excel(xlsx_example, range = "mtcars!B1:D5")
#> # A tibble: 4 × 3
#>     cyl  disp    hp
#>   <dbl> <dbl> <dbl>
#> 1     6   160   110
#> 2     6   160   110
#> 3     4   108    93
#> # ... with 1 more rows

If NAs are represented by something other than blank cells, set the na argument.

read_excel(xlsx_example, na = "setosa")
#> # A tibble: 150 × 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl>   <chr>
#> 1          5.1         3.5          1.4         0.2    <NA>
#> 2          4.9         3.0          1.4         0.2    <NA>
#> 3          4.7         3.2          1.3         0.2    <NA>
#> # ... with 147 more rows

If you are new to the tidyverse conventions for data import, you may want to consult the data import chapter in R for Data Science. readxl will become increasingly consistent with other packages, such as readr.

Articles

You can find more detail in these articles:

  • Cell and Column Types
  • Sheet Geometry
  • readxl Workflows

Features

  • No external dependency on, e.g., Java or Perl.

  • Re-encodes non-ASCII characters to UTF-8.

  • Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed correctly.

  • Discovers the minimal data rectangle and returns that, by default. User can exert more control with range, skip, and n_max.

  • Column names and types are determined from the data in the sheet, by default. User can also supply via col_names and col_types.

  • Returns a tibble, i.e. a data frame with an additional tbl_df class. Among other things, this provide nicer printing.

Other relevant packages

Here are some other packages with functionality that is complementary to readxl and that also avoid a Java dependency.

Writing Excel files: The example files datasets.xlsx and datasets.xls were created with the help of openxlsx (and Excel). openxlsx provides "a high level interface to writing, styling and editing worksheets".

l <- list(iris = iris, mtcars = mtcars, chickwts = chickwts, quakes = quakes)
openxlsx::write.xlsx(l, file = "inst/extdata/datasets.xlsx")

Non-tabular data and formatting: tidyxl is focused on importing awkward and non-tabular data from Excel. It also "exposes cell content, position and formatting in a tidy structure for further manipulation".

News

readxl 1.0.0

Sheet geometry

  • range is a new argument for reading a rectangular range, possibly open. (#314, #8)

  • n_max is a new argument that limits the number of data rows read. (#306, #281)

  • Empty cells, rows, columns (xlsx #248 and #240, xls #271): Cells with no content are no longer loaded, even if they appear in the file. Affects cells that have no data but that carry explicit formatting, detectable in Excel as seemingly empty cells with a format other than "General". Such cells may still exist in the returned tibble, with value NA, depending on the sheet geometry.

    • Eliminates a source of trailing rows (#203) and columns (#236, #162, #146) consisting entirely of NA.
    • Eliminates a subtle source of disagreement between user-provided column names and guessed column types (#169, #81).
    • Embedded empty columns are no longer automatically dropped, regardless of whether there is a column name. (#157, #261)
    • Worksheets that are completely empty or that contain only column names no longer error, but return a tibble with zero rows. (#222, #144, #65)
    • Improved handling of leading and embedded blank rows and explicit row skipping. (#224, #194, #178, #156, #101)
  • User-supplied col_names are processed relative to user-supplied col_types, if given. Specifically, col_names is considered valid if it has the same length as col_types, before or after removing skipped columns. (#81, #261)

Column types and coercion

  • "list" is a new accepted value for col_types. Loads data as a list of length-1 vectors, that are typed using the logic from col_types = NULL, but on a cell-by-cell basis (#262 @gergness).

  • "logical" is a new accepted value for col_types. When col_types = NULL, it is the guessed type for cells Excel advertises as Boolean. When a column has no data, it is now filled with logical NA. (#277, #270)

  • "guess" is a new accepted value for col_types. Allows the user to specify some column types, while allowing others to be guessed (#286)

  • A user-specified col_types of length one will be replicated to have length equal to the number of columns. (#127, #114, #261)

  • "blank" has been deprecated in favor of the more descriptive and readr-compatible "skip", which is now the preferred way to request that a column be skipped. (#260, #193, #261)

  • guess_max is a new argument that lets user adjust the number of rows used to guess column types. (#223, #257 @tklebel and @jennybc)

  • trim_ws is a new argument to remove leading and trailing whitespace. It defaults to TRUE. (#326, #211)

  • na can now hold multiple NA values, e.g., read_excel("missing-values.xls", na = c("NA", "1")). (#13, #56, @jmarshallnz)

  • Coercions and cell data:

    • Numeric data that appears in a date column is coerced to a date. Throws a warning. (#277, #266)
    • Dates that appear in a numeric column are converted to NA instead of their integer representation. Throws warning. (#277, #263)
    • "Number stored as text": when a text cell is found in a numeric column, read_excel() attempts to coerce the string to numeric and falls back to NA if unsuccessful. Throws warning. (#277, #217, #106)
    • Cells in error are treated as blank and are imported as NA (instead of the string "error"). (#277, #62)
    • BoolErr cells are now handled in xls. Suppresses message "Unknown type: 517". (#274, #259)
    • Dates that arise from a formula are treated as dates (vs. numeric) in xls. (#277)
    • Dates in .xlsx files saved with LibreOffice are now recognized as such. (#134, @zeehio)

Compatibility

Many 3rd party tools write xls and xlsx that comply with the spec, but that are quite different from files produced by Excel.

  • Namespace prefixes are now stripped from element names and attributes when parsing XML from xlsx. Workaround for the creative approach taken in some other s/w, coupled with rapidxml's lack of namespace support. (#295, #268, #202, #80)

  • Excel mixes 0- and 1-indexing in reported row and column dimensions for xls and libxls expects that. Other s/w may index from 0 for both, preventing libxls from reading the last column. Patched to restore access to those cells. (#273, #180, #152, #99)

  • More robust logic for sheet lookup in xlsx. Improves compatibility with xlsx written by a variety of tools and/or xlsx containing chartsheets. (#233, #104, #200, #168, #116, @jimhester and @jennybc)

  • The numFmtId attribute is no longer accessed when it does not exist (xlsx written by http://epplus.codeplex.com). (#191, #229)

  • Location is inferred for cells that do not declare their location (xlsx written by JMP). (#240, #163, #102)

Other

  • read_xls() and read_xlsx() are now exposed, such that files without an .xls or .xlsx extension can be read. (#85, @jirkalewandowski)

  • The Lotus 1-2-3 leap year bug is now accounted for, i.e. date-times prior to March 1, 1900 import correctly. Date-times on the non-existent leap day February 29, 1900 import as NA and throw a warning. (#264, #148, #292)

  • The tibble package is now imported (#175, @krlmlr) and tibble::repair_names() is used to prevent empty, NA, or duplicated column names. (#216, #208, #199 #182, #53, #247).

  • Default column names for xlsx now start with X__1 instead of X__0. (#98, @zeehio, @krlmlr)

  • Fix compilation warnings/failures (FreeBSD 10.3 #221, gcc 4.9.3 #124) and/or problems reading xls (CentOS 6.6 #189). (#244, #245, #246 @jeroen)

  • Unwanted printed output (e.g., DEFINEDNAME: 21 00 00 ...) is suppressed when reading xls that contains a defined range. (#82, #188, @PedramNavid)

readxl 0.1.1

  • Add support for correctly reading strings in .xlsx files containing escaped unicode characters (e.g. _x005F_). (#51, @jmarshallnz)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("readxl")

1.1.0 by Jennifer Bryan, 3 months ago


http://readxl.tidyverse.org, https://github.com/tidyverse/readxl


Report a bug at https://github.com/tidyverse/readxl/issues


Browse source code at https://github.com/cran/readxl


Authors: Hadley Wickham [aut] (<https://orcid.org/0000-0003-4757-117X>), Jennifer Bryan [aut, cre] (<https://orcid.org/0000-0002-6983-2759>), RStudio [cph, fnd] (Copyright holder of all R code and all C/C++ code without explicit copyright attribution), Marcin Kalicinski [ctb, cph] (Author of included RapidXML code), Komarov Valery [ctb, cph] (Author of included libxls code), Christophe Leitienne [ctb, cph] (Author of included libxls code), Bob Colbert [ctb, cph] (Author of included libxls code), David Hoerl [ctb, cph] (Author of included libxls code), Evan Miller [ctb, cph] (Author of included libxls code)


Documentation:   PDF Manual  


GPL-3 license


Imports cellranger, Rcpp, tibble

Suggests covr, knitr, rmarkdown, rprojroot, testthat

Linking to Rcpp


Imported by BAYESDEF, BEACH, CONS, DCM, DLMtool, EdSurvey, GWSDAT, GerminaR, GetTDData, Luminescence, NPMOD, PNADcIBGE, QuantTools, RDML, RLumShiny, RcmdrMisc, Ricetl, SchemaOnRead, TR8, aire.zmvm, bikedata, breathtestcore, chillR, crosswalkr, dextergui, dpcR, driftR, elementR, esaps, evaluator, ggplotgui, googlePublicData, ie2misc, joinXL, jstor, memapp, photobiologyInOut, popprxl, radiant.data, readabs, readit, readtext, rio, shinyHeatmaply, sstModel, textreadr, tidyverse, visvow, xlutils3, zooaRchGUI.

Depended on by DataLoader.

Suggested by PCRedux, ProjectTemplate, RCzechia, Rcmdr, RcmdrPlugin.EZR, baytrends, blorr, cRegulome, ckanr, descriptr, dynBiplotGUI, ezec, heemod, inferr, manifestoR, olsrr, rattle, rfm, rotl, secr, syuzhet, tidyxl, unpivotr, unvotes, writexl.


See at CRAN