Imports non-tabular from Excel files into R. Exposes cell content, position and formatting in a tidy structure for further manipulation. Tokenizes Excel formulas. Supports '.xlsx' and '.xlsm' via the embedded 'RapidXML' C++ library < http://rapidxml.sourceforge.net>. Does not support '.xlsb' or '.xls'.
tidyxl imports non-tabular data from Excel files into R. It exposes cell content, position, formatting and comments in a tidy structure for further manipulation, especially by the unpivotr package. It supports the xml-based file formats ‘.xlsx’ and ‘.xlsm’ via the embedded RapidXML C++ library. It does not support the binary file formats ‘.xlsb’ or ‘.xls’.
It also provides a function
xlex() for tokenizing formulas. See the
vignette for details. It is useful for detecting
‘spreadsheet smells’ (poor practice such as embedding constants in
formulas, or using deep levels of nesting), and for understanding the
dependency structures within spreadsheets.
Tidyxl represents each cell as one row in a dataframe. Like this.
What can you do with tidy cells? The best places to start are:
Otherwise see examples below.
The version 1.0.0 release has some breaking changes. See
details. The previous version can be installed as
devtools::install_version("tidyxl", version = "0.2.3", repos = "")
The package includes a spreadsheet, ‘titanic.xlsx’, which contains the following pivot table:
ftable(Titanic, row.vars = 1:2)#> Age Child Adult#> Survived No Yes No Yes#> Class Sex#> 1st Male 0 5 118 57#> Female 0 1 4 140#> 2nd Male 0 11 154 14#> Female 0 13 13 80#> 3rd Male 35 13 387 75#> Female 17 14 89 76#> Crew Male 0 0 670 192#> Female 0 0 3 20
The multi-row column headers make this difficult to import. A popular package for importing spreadsheets coerces the pivot table into a dataframe. It treats the second header row as though it were observations.
titanic <- system.file("extdata/titanic.xlsx", package = "tidyxl")readxl::read_excel(titanic)#> New names:#> * `` -> `..1`#> * `` -> `..2`#> * `` -> `..5`#> * `` -> `..7`#> # A tibble: 10 x 7#> ..1 ..2 Age Child ..5 Adult ..7#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>#> 1 <NA> <NA> Survived No Yes No Yes#> 2 Class Sex <NA> <NA> <NA> <NA> <NA>#> 3 1st Male <NA> 0 5 118 57#> 4 <NA> Female <NA> 0 1 4 140#> 5 2nd Male <NA> 0 11 154 14#> 6 <NA> Female <NA> 0 13 13 80#> 7 3rd Male <NA> 35 13 387 75#> 8 <NA> Female <NA> 17 14 89 76#> 9 Crew Male <NA> 0 0 670 192#> 10 <NA> Female <NA> 0 0 3 20
tidyxl doesn’t coerce the pivot table into a data frame. Instead, it represents each cell in its own row, where it describes the cell’s address, value and other properties.
library(tidyxl)x <- xlsx_cells(titanic)dplyr::glimpse(x)#> Observations: 60#> Variables: 21#> $ sheet <chr> "Sheet1", "Sheet1", "Sheet1", "Sheet1", "She…#> $ address <chr> "C1", "D1", "E1", "F1", "G1", "C2", "D2", "E…#> $ row <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4,…#> $ col <int> 3, 4, 5, 6, 7, 3, 4, 5, 6, 7, 1, 2, 1, 2, 4,…#> $ is_blank <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALS…#> $ data_type <chr> "character", "character", "blank", "characte…#> $ error <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …#> $ logical <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …#> $ numeric <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …#> $ date <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…#> $ character <chr> "Age", "Child", NA, "Adult", NA, "Survived",…#> $ character_formatted <list> [<# A tibble: 1 x 14, character bold ita…#> $ formula <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …#> $ is_array <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…#> $ formula_ref <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …#> $ formula_group <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …#> $ comment <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …#> $ height <dbl> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, …#> $ width <dbl> 8.38, 8.38, 8.38, 8.38, 8.38, 8.38, 8.38, 8.…#> $ style_format <chr> "Normal", "Normal", "Normal", "Normal", "Nor…#> $ local_format_id <int> 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 1,…
In this structure, the cells can be found by filtering.
x[x$data_type == "character", c("address", "character")]#> # A tibble: 22 x 2#> address character#> <chr> <chr>#> 1 C1 Age#> 2 D1 Child#> 3 F1 Adult#> 4 C2 Survived#> 5 D2 No#> 6 E2 Yes#> 7 F2 No#> 8 G2 Yes#> 9 A3 Class#> 10 B3 Sex#> # … with 12 more rowsx[x$row == 4, c("address", "character", "numeric")]#> # A tibble: 6 x 3#> address character numeric#> <chr> <chr> <dbl>#> 1 A4 1st NA#> 2 B4 Male NA#> 3 D4 <NA> 0#> 4 E4 <NA> 5#> 5 F4 <NA> 118#> 6 G4 <NA> 57
Specific sheets can be requested using
xlsx_cells(file, sheet), and
the names of all sheets in a file are given by
The original spreadsheet has formatting applied to the cells. This can
also be retrieved using tidyxl,
Formatting is available by using the columns
style_format as indexes into a separate list-of-lists structure.
‘Local’ formatting is the most common kind, applied to individual
cells. ‘Style’ formatting is usually applied to blocks of cells, and
defines several formats at once. Here is a screenshot of the styles
buttons in Excel.
Formatting can be looked up as follows.
# Boldformats <- xlsx_formats(titanic)formats$local$font$bold#>  FALSE TRUE FALSE FALSEx[x$local_format_id %in% which(formats$local$font$bold),c("address", "character")]#> # A tibble: 4 x 2#> address character#> <chr> <chr>#> 1 C1 Age#> 2 C2 Survived#> 3 A3 Class#> 4 B3 Sex# Yellow fillformats$local$fill$patternFill$fgColor$rgb#>  NA NA NA "FFFFFF00"x[x$local_format_id %in%which(formats$local$fill$patternFill$fgColor$rgb == "FFFFFF00"),c("address", "numeric")]#> # A tibble: 2 x 2#> address numeric#> <chr> <dbl>#> 1 F11 3#> 2 G11 20# Styles by nameformats$style$font$name["Normal"]#> Normal#> "Calibri"head(x[x$style_format == "Normal", c("address", "character")])#> # A tibble: 6 x 2#> address character#> <chr> <chr>#> 1 C1 Age#> 2 D1 Child#> 3 E1 <NA>#> 4 F1 Adult#> 5 G1 <NA>#> 6 C2 Survived# In-cell formatting is available in the `character_formatted` column as a data# frame, one row per substring.examples <- system.file("/extdata/examples.xlsx", package = "tidyxl")xlsx_cells(examples)$character_formatted#> []#> # A tibble: 16 x 14#> character bold italic underline strike vertAlign size color_rgb#> <chr> <lgl> <lgl> <chr> <lgl> <chr> <dbl> <chr>#> 1 in-cell FALSE FALSE <NA> FALSE <NA> 0 <NA>#> 2 bold TRUE FALSE <NA> FALSE <NA> 0 FF000000#> 3 italic FALSE TRUE <NA> FALSE <NA> 0 FF000000#> 4 underline FALSE FALSE single FALSE <NA> 0 FF000000#> 5 underlin… FALSE FALSE double FALSE <NA> 0 FF000000#> 6 singleac… FALSE FALSE singleAc… FALSE <NA> 0 FF000000#> 7 doubleac… FALSE FALSE doubleAc… FALSE <NA> 0 FF000000#> 8 striketh… FALSE FALSE <NA> TRUE <NA> 0 FF000000#> 9 subscript FALSE FALSE <NA> FALSE subscript 0 FF000000#> 10 superscr… FALSE FALSE <NA> FALSE superscr… 0 FF000000#> 11 red FALSE FALSE <NA> FALSE <NA> 0 FFFF0000#> 12 theme FALSE FALSE <NA> FALSE <NA> 0 FFC0504D#> 13 tint FALSE FALSE <NA> FALSE <NA> 0 FFC0504D#> 14 size FALSE FALSE <NA> FALSE <NA> 0 FFFF0000#> 15 arial FALSE FALSE <NA> FALSE <NA> 0 FFFF0000#> 16 "UTF8Sté… FALSE FALSE <NA> FALSE <NA> 0 FFFF0000#> # … with 6 more variables: color_theme <int>, color_indexed <int>,#> # color_tint <dbl>, font <chr>, family <int>, scheme <chr>
To see all the available kinds of formats, use
Comments are available alongside cell values.
x[!is.na(x$comment), c("address", "comment")]#> # A tibble: 1 x 2#> address comment#> <chr> <chr>#> 1 G11 All women in the crew worked in the victualling department.
Formulas are available, but with a few quirks.
options(width = 120)y <- xlsx_cells(examples, "Sheet1")y[!is.na(y$formula),c("address", "formula", "is_array", "formula_ref", "formula_group","error", "logical", "numeric", "date", "character")]#> # A tibble: 32 x 10#> address formula is_array formula_ref formula_group error logical numeric date character#> <chr> <chr> <lgl> <chr> <int> <chr> <lgl> <dbl> <dttm> <chr>#> 1 A1 1/0 FALSE <NA> NA #DIV/0! NA NA NA <NA>#> 2 A14 1=1 FALSE <NA> NA <NA> TRUE NA NA <NA>#> 3 A15 A4+1 FALSE <NA> NA <NA> NA 1338 NA <NA>#> 4 A16 DATE(2017,1,18) FALSE <NA> NA <NA> NA NA 2017-01-18 00:00:00 <NA>#> 5 A17 "\"Hello, World!… FALSE <NA> NA <NA> NA NA NA Hello, Worl…#> 6 A19 $A$18+1 FALSE <NA> NA <NA> NA 2 NA <NA>#> 7 B19 A18+2 FALSE <NA> NA <NA> NA 3 NA <NA>#> 8 A20 $A$18+1 FALSE A20:A21 0 <NA> NA 2 NA <NA>#> 9 B20 A19+2 FALSE B20:B21 1 <NA> NA 4 NA <NA>#> 10 A21 $A$18+1 FALSE <NA> 0 <NA> NA 2 NA <NA>#> # … with 22 more rows
The top five cells show that the results of formulas are available as
usual in the columns
A21 illustrate how formulas are normalised before
being written to file, using the
columns. When there is a group of cells whose formulas only differ by
cell reference (e.g. “=A1+1”, “=A2+1”, “=A3+1”, etc.), only one formula
in each group is written to the file, so
tidyxl infers what the
formulas in the other cells in the group must be, from their relative
There are two kinds of array formulas: ones that compute over arrays, and ones whose output is an array (of cells).
Both kinds are distinguished in spreadsheet programs by curly braces,
tidyxl, the curly braces are
ommitted (as they are from the file itself), and instead the
column has the value
The first kind (those that compute over arrays) is illustrated by cell
The second kind (those whose value is spread across an array of cells)
is illustrated by cells
A24. The formula is only given in
the top-left cell (
A23), which is also the only cell that describes
the range of cells containing the result, in the
The results themselves are stored in all relevant cells (
A24). Unlike shared formulas, there is no
formula_group to associate
the cells of an array formula’s result. If you need to do identify those
cells, use the cellranger
package and the
A25 contains a formula that refers to another file. The
an index into a table of files. The roadmap for
tidyxl includes de-referencing
xlex() separates formulas into tokens of different types,
and gives their depth within a nested formula. Its name is a bad pun on
‘Excel’ and ‘lexer’. Try the online
demo, or install the more
experimental lexl package to run
It is useful for detecting spreadsheet smells, which are poor practices in spreadsheet design, such as deep nests of functions, or embedding constants in formulas.
x <- xlex("MIN(3,MAX(2,A1))")x#> root#> ¦-- MIN function#> °-- ( fun_open#> ¦-- 3 number#> ¦-- , separator#> ¦-- MAX function#> °-- ( fun_open#> ¦-- 2 number#> ¦-- , separator#> °-- A1 ref#> °-- ) fun_close#> °-- ) fun_close
See the vignette for more examples and details.
Names are imported with
xlex_names(). AKA ‘named formulas’ and
‘defined names’, these are usually used to name particular cells or
ranges, making formulas that refer to them more readable. Ones that
are ranges are identifed by the
is_range column (using
is_range()), making it easier to match the names to the cells returned
xlsx_cells() – e.g. by using the
When the scope of the name is within a particular sheet, rather than global, the sheet name is given.
xlsx_names(examples)#> rId sheet name formula comment hidden is_range#> 1 1 Sheet1 named_local_formula MAX(Sheet1!$A$129:$A$130)+1 <NA> FALSE FALSE#> 2 4 E09904.2 sheet_beyond_chart E09904.2!$A$1,E09904.2!$C$1 <NA> FALSE TRUE#> 3 NA <NA> intersection Sheet1!$B:$B Sheet1!$8:$8 <NA> FALSE TRUE#> 4 NA <NA> named_global_formula Sheet1!$A$129-1 <NA> FALSE FALSE#> 5 NA <NA> named_range Sheet1!$A$129 My comment FALSE TRUE
Data validation rules are imported with
xlsx_validation(). These rules
control what values may be entered into a cell, and are often used to
create a drop-down list in a cell. Read the
vignette for details.
xlsx_validation(examples)#> # A tibble: 15 x 14#> sheet ref type operator formula1 formula2 allow_blank show_input_mess… prompt_title prompt_body show_error_mess…#> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <chr> <lgl>#> 1 Shee… A106 whole between 0 9 TRUE TRUE message tit… message bo… TRUE#> 2 Shee… A108 list <NA> $B$108 <NA> TRUE TRUE <NA> <NA> TRUE#> 3 Shee… A110 date between 2017-01… 2017-01… TRUE TRUE <NA> <NA> TRUE#> 4 Shee… A111 time between 00:00:00 09:00:00 TRUE TRUE <NA> <NA> TRUE#> 5 Shee… A112 text… between 0 9 TRUE TRUE <NA> <NA> TRUE#> 6 Shee… A114 whole notBetw… 0 9 TRUE TRUE <NA> <NA> TRUE#> 7 Shee… A115… whole equal 0 <NA> TRUE TRUE <NA> <NA> TRUE#> 8 Shee… A116 whole notEqual 0 <NA> TRUE TRUE <NA> <NA> TRUE#> 9 Shee… A117 whole greater… 0 <NA> TRUE TRUE <NA> <NA> TRUE#> 10 Shee… A119 whole greater… 0 <NA> TRUE TRUE <NA> <NA> TRUE#> 11 Shee… A120 whole lessTha… 0 <NA> TRUE TRUE <NA> <NA> TRUE#> 12 Shee… A118 whole lessThan 0 <NA> TRUE TRUE <NA> <NA> TRUE#> 13 Shee… A107 deci… notBetw… 0 9 FALSE FALSE <NA> <NA> FALSE#> 14 Shee… A113 cust… <NA> A113<=L… <NA> TRUE TRUE <NA> <NA> TRUE#> 15 Shee… A109 list <NA> $B$108 <NA> TRUE TRUE <NA> <NA> TRUE#> # … with 3 more variables: error_title <chr>, error_body <chr>, error_symbol <chr>
Information in in many spreadsheets cannot be easily imported into R. Why?
Most R packages that import spreadsheets have difficulty unless the layout of the spreadsheet conforms to a strict definition of a ‘table’, e.g.:
These rules are designed to eliminate ambiguity in the interpretation of the information. But most spreadsheeting software relaxes these rules in a trade of ambiguity for expression via other media:
Humans can usually resolve the ambiguities with contextual knowledge, but computers are limited by their ignorance. Programmers are hampered by:
Information is lost when software discards it in order to force the data into tabular form. Sometimes date formatting is retained, but mostly formatting is lost, and position has to be inferred again.
tidyxl addresses the programmer’s problems by not discarding information. It imports the content, position and formatting of cells, leaving it up to the user to associate the different forms of information, and to re-encode them in tabular form without loss. The unpivotr package has been developed to assist with that step.
tidyxl was originally derived from readxl and still contains some of the same code, hence it inherits the GPL-3 licence. readxl is intended for importing tabular data with a single row of column headers, whereas tidyxl is more general, and less magic.
The rsheets project of several R packages is in the early stages of importing spreadsheet information from Excel and Google Sheets into R, manipulating it, and potentially parsing and processing formulas and writing out to spreadsheet files. In particular, jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms.
tidyxl differs from rsheets in scope (tidyxl will never import charts, for example), and implementation (tidyxl is implemented mainly in C++ and is quite fast, only a little slower than readxl). unpivotr is a package related to tidyxl that provides tools for unpivoting complex and non-tabular data layouts using I not AI (intelligence, not artificial intelligence). In this way it corresponds to jailbreaker, but with a different philosophy.
For bugs and/or issues, create a new issue on GitHub For other questions or comments, please subscribe to the tidyxl-devel mailing list. You must be a member to post messages, but anyone can read the archived discussions.
include_blank_cells = FALSEhad a bug that returned blank cells as an empty row in the
_as dates when the underscore is followed by a date-ish character like
include_blank_cells = FALSEin
maybe_xlsx()is provided for checking whether a file might be in the xlsx format. It is impossible to be sure from the magic number alone, because the magic numbers are either common to all zip files, or common to other Microsoft Office files (e.g. .doc, .ppt).
tidy_xlsx(), which has been deprecated.
xlsx_cells()returns a single data frame of all the cells in scope (the whole workbook, or chosen sheets), rather than a list of separate data frames for each sheet.
xlsx_formats()performs orders of magnitude faster.
xlsx_validation()imports validation rules from cells that restrict data input, such as cells that require a selection from a drop-down list. See the vignette
vignette("data-validation-rules", package = "tidyxl").
xlsx_names()imports defined names (aka named ranges/formulas), which can be used to filter for particular ranges of cells by name. Use
is_range()to filter for ones that are named ranges, and then read joining rules to cells for how to join cell ranges to cell addresses. This will become easier in a future release.
is_range()checks whether a formula is simply ranges of cells.
xlex()tokenises formulas. This is useful for detecting spreadsheet smells like embedded constants and deep nesting. There is a demo Shiny app, and a vignette
vignette("smells", package = "tidyxl"). A vector of Excel function names
excel_functionscan be used to separated built-in functions from custom functions. More experimental features will be implemented in the off-CRAN package lexl before becoming part of tidyxl.
xlsx_cells()$character_formattedis a new column for the in-cell formatting of text (#5). This is for when different parts of text in a single cell have been formatted differently from one another.
is_date_format()checks whether a number format string is a date format. This is useful if a cell formula contains a number formatting string (e.g.
TEXT(45678,"yyyy")), and you need to know that the constant 45678 is a date in order to recover it at full resolution (rather than parsing the character output "2025" as a year).
xlsx_color_theme()and it's British alias
xlsx_colour_theme()returns the theme colour palette used in a file. This is useful to monitor use of a corporate standard theme.
xlsx_color_standardand it's British alias
xlsx_colour_standardare data frames of the standard Excel palette (
B2(for more details see issue #7).
tidy_xlsx()has been deprecated in favour of
xlsx_cells(), which returns a data frame of all the cells in the workbook (or in the requested sheets), and
xlsx_formats(), which returns a lookup list of cell formats.
tidy_xlsx()and one of it's replacments
contenthas been replaced by
is_blank, a logical value indicating whether the cell contains data. Please replace
!is_blankto filter out blank cells (ones with formatting but no value).
formula_typehas been replaced by
is_array, a logical value indicating whether the cell's formula is an array formula or not. In Excel array formulas are represented visually by being surrounded by curly braces
tidy_xlsx(), theme colours are given by name rather than by number, e.g.
"[Cyan]0%") are no longer mis-detect as dates (#21).
is_date_format()tests whether a number format is a date format.
xlsx_formats()is now thoroughly tested, and several relatively minor bugs fixed. For example,
"none"when a pattern fill has not been set, and escape-backslashes are consistently omitted from numFmts.
"1"to support files created by the
"Normal") instead of by index integer. All the vectors under
x$formats$styleare named according to the style names.
x$data$sheet$style_format_idhas been renamed to
x$data$sheet$style_formatand its type changed from integer (index into style formats) to character (still an index, but looking up the named vectors by name). There are examples in the README and vignette.
NEWS.mdfile to track changes to the package.