Import excel files into R. Supports '.xls' via the embedded 'libxls' C library < https://github.com/libxls/libxls> and '.xlsx' via the embedded 'RapidXML' C++ library < http://rapidxml.sourceforge.net>. Works on Windows, Mac and Linux without external dependencies.
The readxl package makes it easy to get data out of Excel and into R. Compared to many of the existing packages (e.g. gdata, xlsx, xlsReadWrite) readxl has no external dependencies, so it’s easy to install and use on all operating systems. It is designed to work with tabular data.
readxl supports both the legacy .xls
format and the modern xml-based
.xlsx
format. The libxls C
library is used to support .xls
, which abstracts away many of the
complexities of the underlying binary format. To parse .xlsx
, we use
the RapidXML C++ library.
The easiest way to install the latest released version from CRAN is to install the whole tidyverse.
install.packages("tidyverse")
NOTE: you will still need to load readxl explicitly, because it is not a
core tidyverse package loaded via library(tidyverse)
.
Alternatively, install just readxl from CRAN:
install.packages("readxl")
Or install the development version from GitHub:
# install.packages("devtools")devtools::install_github("tidyverse/readxl")
library(readxl)
readxl includes several example files, which we use throughout the
documentation. Use the helper readxl_example()
with no arguments to
list them or call it with an example filename to get the path.
readxl_example()#> [1] "clippy.xls" "clippy.xlsx" "datasets.xls" "datasets.xlsx"#> [5] "deaths.xls" "deaths.xlsx" "geometry.xls" "geometry.xlsx"#> [9] "type-me.xls" "type-me.xlsx"readxl_example("clippy.xls")#> [1] "/Users/jenny/resources/R/library/readxl/extdata/clippy.xls"
read_excel()
reads both xls and xlsx files and detects the format from
the extension.
xlsx_example <- readxl_example("datasets.xlsx")read_excel(xlsx_example)#> # A tibble: 150 x 5#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> <dbl> <dbl> <dbl> <dbl> <chr>#> 1 5.1 3.5 1.4 0.2 setosa#> 2 4.9 3 1.4 0.2 setosa#> 3 4.7 3.2 1.3 0.2 setosa#> # … with 147 more rowsxls_example <- readxl_example("datasets.xls")read_excel(xls_example)#> # A tibble: 150 x 5#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> <dbl> <dbl> <dbl> <dbl> <chr>#> 1 5.1 3.5 1.4 0.2 setosa#> 2 4.9 3 1.4 0.2 setosa#> 3 4.7 3.2 1.3 0.2 setosa#> # … with 147 more rows
List the sheet names with excel_sheets()
.
excel_sheets(xlsx_example)#> [1] "iris" "mtcars" "chickwts" "quakes"
Specify a worksheet by name or number.
read_excel(xlsx_example, sheet = "chickwts")#> # A tibble: 71 x 2#> weight feed#> <dbl> <chr>#> 1 179 horsebean#> 2 160 horsebean#> 3 136 horsebean#> # … with 68 more rowsread_excel(xls_example, sheet = 4)#> # A tibble: 1,000 x 5#> lat long depth mag stations#> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 -20.4 182. 562 4.8 41#> 2 -20.6 181. 650 4.2 15#> 3 -26 184. 42 5.4 43#> # … with 997 more rows
There are various ways to control which cells are read. You can even specify the sheet here, if providing an Excel-style cell range.
read_excel(xlsx_example, n_max = 3)#> # A tibble: 3 x 5#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> <dbl> <dbl> <dbl> <dbl> <chr>#> 1 5.1 3.5 1.4 0.2 setosa#> 2 4.9 3 1.4 0.2 setosa#> 3 4.7 3.2 1.3 0.2 setosaread_excel(xlsx_example, range = "C1:E4")#> # A tibble: 3 x 3#> Petal.Length Petal.Width Species#> <dbl> <dbl> <chr>#> 1 1.4 0.2 setosa#> 2 1.4 0.2 setosa#> 3 1.3 0.2 setosaread_excel(xlsx_example, range = cell_rows(1:4))#> # A tibble: 3 x 5#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> <dbl> <dbl> <dbl> <dbl> <chr>#> 1 5.1 3.5 1.4 0.2 setosa#> 2 4.9 3 1.4 0.2 setosa#> 3 4.7 3.2 1.3 0.2 setosaread_excel(xlsx_example, range = cell_cols("B:D"))#> # A tibble: 150 x 3#> Sepal.Width Petal.Length Petal.Width#> <dbl> <dbl> <dbl>#> 1 3.5 1.4 0.2#> 2 3 1.4 0.2#> 3 3.2 1.3 0.2#> # … with 147 more rowsread_excel(xlsx_example, range = "mtcars!B1:D5")#> # A tibble: 4 x 3#> cyl disp hp#> <dbl> <dbl> <dbl>#> 1 6 160 110#> 2 6 160 110#> 3 4 108 93#> # … with 1 more row
If NA
s are represented by something other than blank cells, set the
na
argument.
read_excel(xlsx_example, na = "setosa")#> # A tibble: 150 x 5#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species#> <dbl> <dbl> <dbl> <dbl> <chr>#> 1 5.1 3.5 1.4 0.2 <NA>#> 2 4.9 3 1.4 0.2 <NA>#> 3 4.7 3.2 1.3 0.2 <NA>#> # … with 147 more rows
If you are new to the tidyverse conventions for data import, you may want to consult the data import chapter in R for Data Science. readxl will become increasingly consistent with other packages, such as readr.
Broad topics are explained in these articles:
We also have some focused articles that address specific aggravations presented by the world’s spreadsheets:
No external dependency on, e.g., Java or Perl.
Re-encodes non-ASCII characters to UTF-8.
Loads datetimes into POSIXct columns. Both Windows (1900) and Mac (1904) date specifications are processed correctly.
Discovers the minimal data rectangle and returns that, by default.
User can exert more control with range
, skip
, and n_max
.
Column names and types are determined from the data in the sheet, by
default. User can also supply via col_names
and col_types
and
control name repair via .name_repair
.
Returns a
tibble, i.e. a
data frame with an additional tbl_df
class. Among other things,
this provide nicer printing.
Here are some other packages with functionality that is complementary to readxl and that also avoid a Java dependency.
Writing Excel files: The example files datasets.xlsx
and
datasets.xls
were created with the help of
openxlsx (and Excel).
openxlsx provides “a high level interface to writing, styling and
editing
worksheets”.
l <- list(iris = iris, mtcars = mtcars, chickwts = chickwts, quakes = quakes)openxlsx::write.xlsx(l, file = "inst/extdata/datasets.xlsx")
writexl is a new option in this space, first released on CRAN in August 2017. It’s a portable and lightweight way to export a data frame to xlsx, based on libxlsxwriter. It is much more minimalistic than openxlsx, but on simple examples, appears to be about twice as fast and to write smaller files.
Non-tabular data and formatting: tidyxl is focused on importing awkward and non-tabular data from Excel. It also “exposes cell content, position and formatting in a tidy structure for further manipulation”.
Please note that the readxl project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Pragmatic patch release to update some tests in advance of v2.1.0 of the tibble package. That release updates name repair: standard suffix becomes ...j
, instead of ..j
, partially motivated by user experience in readxl.
readxl embeds libxls v1.5.0. This is the first official release of libxls in several years, although readxl has been tracking the development version in the interim. The libxls project is now officially hosted at https://github.com/libxls/libxls. In particular, libxls v1.5.0 addresses these two CVEs:
readxl exposes the .name_repair
argument that is coming to version 2.0.0 of the tibble package. The readxl default is .name_repair = "unique"
, keeping with the readxl convention to ensure column names are neither missing nor duplicated.
.name_repair
is available, defaulting to .name_repair = "unique"
. Otherwise, the legacy function tibble::repair_names(prefix = "X", sep = "__")
is used, replicating the behaviour of readxl v1.1.0.
x
.""
, x
, x
..1
, x..2
, x..3
X__1
, x
, x__1
read_excel()
and friends gain a progress
argument that controls a progress spinner (#243, #538).
read_xls()
and read_xlsx()
pass the trim_ws
argument along (#514).
readxl has a new article on reading Excel files with multiple header rows (#486, #492 @apreshill).
xlsx files that do not have a "styles" part can now be read (#505, #506 @jt6)
All paths are passed through normalizePath()
(#498, #499, new behaviour for xlsx but not xls) and enc2native()
(#370).
readxl is now tested back to R >= 3.1.
Embedded libxls has been updated, using the source in https://github.com/evanmiller/libxls. readxl's DESCRIPTION now records the SHA associated to the embedded libxls in a Note
.
read_excel()
and excel_sheets()
associate a larger set of file extensions with xlsx and are better able to guess the format of a file with a nonstandard or missing extension. This is about deciding whether to treat a file as xls or xlsx. (#342, #411, #457)
excel_format()
is the newly-exported format-guessing function.format_from_ext()
is a low-level helper, also exported, that only consults file extension. In addition to the obvious interpretation of .xls
and .xlsx
, the extensions .xlsm
, .xltx
, and .xltm
are now associated with xlsx.format_from_signature()
is a low-level helper, also exported, that consults the file's signature (a.k.a. magic number). It's handy for files that lack an extension.Embedded libxls has been updated to address security vulnerabilitities identified in late 2017 (#441, #442).
xlsx structured as a "minimal conformant SpreadsheetML package" can be read. Most obvious feature of such sheets is the lack of an xl/
directory in the unzipped form. (xlsx, #435, #437)
Reading xls sheet with exactly 65,536 rows no longer enters an infinite loop. (xls, #373, #416, #432 @vkapartzianis)
Doubles, including datetimes, coerced to character from xls now have much higher precision, comparable to the xlsx behaviour. (xls, #430, #431)
Integer-y numbers larger than 2^31 are coerced properly to string (xls, #346)
Shared strings are only compared to NA strings after lookup, never on the basis of their index. (xlsx, #401)
Better checks and messaging around nonexistent files. (#392)
Add $(C_VISIBILITY)
to compiler flags to hide internal symbols from the dll. (#385 @jeroen)
Numeric data in a logical column now coerces properly to logical. (xlsx, #385 @nacnudus)
range
is a new argument for reading a rectangular range, possibly open. (#314, #8)
n_max
is a new argument that limits the number of data rows read. (#306, #281)
Empty cells, rows, columns (xlsx #248 and #240, xls #271): Cells with no content are no longer loaded, even if they appear in the file. Affects cells that have no data but that carry explicit formatting, detectable in Excel as seemingly empty cells with a format other than "General". Such cells may still exist in the returned tibble, with value NA
, depending on the sheet geometry.
NA
.User-supplied col_names
are processed relative to user-supplied col_types
, if given. Specifically, col_names
is considered valid if it has the same length as col_types
, before or after removing skipped columns. (#81, #261)
"list"
is a new accepted value for col_types
. Loads data as a list of length-1 vectors, that are typed using the logic from col_types = NULL
, but on a cell-by-cell basis (#262 @gergness).
"logical"
is a new accepted value for col_types
. When col_types = NULL
, it is the guessed type for cells Excel advertises as Boolean. When a column has no data, it is now filled with logical NA
. (#277, #270)
"guess"
is a new accepted value for col_types
. Allows the user to specify some column types, while allowing others to be guessed (#286)
A user-specified col_types
of length one will be replicated to have length equal to the number of columns. (#127, #114, #261)
"blank"
has been deprecated in favor of the more descriptive and readr-compatible "skip"
, which is now the preferred way to request that a column be skipped. (#260, #193, #261)
guess_max
is a new argument that lets user adjust the number of rows used to guess column types. (#223, #257 @tklebel and @jennybc)
trim_ws
is a new argument to remove leading and trailing whitespace. It defaults to TRUE
. (#326, #211)
na
can now hold multiple NA values, e.g., read_excel("missing-values.xls", na = c("NA", "1"))
. (#13, #56, @jmarshallnz)
Coercions and cell data:
NA
instead of their integer representation. Throws warning. (#277, #263)read_excel()
attempts to coerce the string to numeric and falls back to NA
if unsuccessful. Throws warning. (#277, #217, #106)NA
(instead of the string "error"
). (#277, #62)"Unknown type: 517"
. (#274, #259)Many 3rd party tools write xls and xlsx that comply with the spec, but that are quite different from files produced by Excel.
Namespace prefixes are now stripped from element names and attributes when parsing XML from xlsx. Workaround for the creative approach taken in some other s/w, coupled with rapidxml's lack of namespace support. (#295, #268, #202, #80)
Excel mixes 0- and 1-indexing in reported row and column dimensions for xls and libxls expects that. Other s/w may index from 0 for both, preventing libxls from reading the last column. Patched to restore access to those cells. (#273, #180, #152, #99)
More robust logic for sheet lookup in xlsx. Improves compatibility with xlsx written by a variety of tools and/or xlsx containing chartsheets. (#233, #104, #200, #168, #116, @jimhester and @jennybc)
The numFmtId
attribute is no longer accessed when it does not exist (xlsx written by http://epplus.codeplex.com). (#191, #229)
Location is inferred for cells that do not declare their location (xlsx written by JMP). (#240, #163, #102)
read_xls()
and read_xlsx()
are now exposed, such that files without an .xls
or .xlsx
extension can be read. (#85, @jirkalewandowski)
The Lotus 1-2-3 leap year bug is now accounted for, i.e. date-times prior to March 1, 1900 import correctly. Date-times on the non-existent leap day February 29, 1900 import as NA and throw a warning. (#264, #148, #292)
The tibble package is now imported (#175, @krlmlr) and tibble::repair_names()
is used to prevent empty, NA
, or duplicated column names. (#216, #208, #199 #182, #53, #247).
Default column names for xlsx now start with X__1 instead of X__0. (#98, @zeehio, @krlmlr)
Fix compilation warnings/failures (FreeBSD 10.3 #221, gcc 4.9.3 #124) and/or problems reading xls (CentOS 6.6 #189). (#244, #245, #246 @jeroen)
Unwanted printed output (e.g., DEFINEDNAME: 21 00 00 ...
) is suppressed when reading xls that contains a defined range. (#82, #188, @PedramNavid)
_x005F_
). (#51, @jmarshallnz)