The main janitor functions can: perfectly format data.frame column names; provide quick counts of variable combinations (i.e., frequency tables and crosstabs); and isolate duplicate records. Other janitor functions nicely format the tabulation results. These tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel. This package follows the principles of the "tidyverse" and works well with the pipe function %>%. janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
- The New York Times, 2014*
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
The main janitor functions:
The tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
You can install:
the most recent officially-released version from CRAN with
the latest development version from GitHub with
A full description of each function, organized by topic, can be found in
janitor’s catalog of functions
you will find functions not mentioned in this README, like
compare_df_cols() which provides a summary of differences in column
names and types when given a set of data.frames.
Below are quick examples of how janitor tools are commonly used.
Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx:
Here’s that data after being read in to R:
library(pacman) # for loading packagesp_load(readxl, janitor, dplyr, here)roster_raw <- read_excel(here("dirty_data.xlsx")) # available atglimpse(roster_raw)#> Observations: 13#> Variables: 11#> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...#> $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...#> $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...#> $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...#> $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80#> $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA#> $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...#> $ Certification__1 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...#> $ Certification__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.
Clean it with janitor functions:
roster <- roster_raw %>%clean_names() %>%remove_empty(c("rows", "cols")) %>%mutate(hire_date = excel_numeric_to_date(hire_date),cert = coalesce(certification, certification_1)) %>% # from dplyrselect(-certification, -certification_1) # drop unwanted columnsroster#> # A tibble: 12 x 8#> first_name last_name employee_status subject hire_date percent_allocated full_time cert#> <chr> <chr> <chr> <chr> <date> <dbl> <chr> <chr>#> 1 Jason Bourne Teacher PE 2008-08-30 0.75 Yes Physical ed#> 2 Jason Bourne Teacher Drafting 2008-08-30 0.25 Yes Physical ed#> 3 Alicia Keys Teacher Music 2001-08-15 1 Yes Instr. music#> 4 Ada Lovelace Teacher <NA> 1975-05-01 1 Yes PENDING#> 5 Desus Nice Administration Dean 2013-06-06 1 Yes PENDING#> 6 Chien-Shiung Wu Teacher Physics 1930-03-20 0.5 Yes Science 6-12#> 7 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.5 Yes Science 6-12#> 8 James Joyce Teacher English 1990-05-01 0.5 No English 6-12#> 9 Hedy Lamarr Teacher Science 1976-06-08 0.5 No PENDING#> 10 Carlos Boozer Coach Basketball 2015-08-05 NA No Physical ed#> 11 Young Boozer Coach <NA> 1995-01-01 NA No Political sci.#> 12 Micheal Larsen Teacher English 2009-09-15 0.8 No Vocal music
The core janitor cleaning function is
clean_names() - call it whenever
you load data into R.
get_dupes() to identify and examine duplicate records during data
cleaning. Let’s see if any teachers are listed more than once:
roster %>% get_dupes(first_name, last_name)#> # A tibble: 4 x 9#> first_name last_name dupe_count employee_status subject hire_date percent_allocat~ full_time cert#> <chr> <chr> <int> <chr> <chr> <date> <dbl> <chr> <chr>#> 1 Chien-Shiung Wu 2 Teacher Physics 1930-03-20 0.5 Yes Science 6~#> 2 Chien-Shiung Wu 2 Teacher Chemistry 1930-03-20 0.5 Yes Science 6~#> 3 Jason Bourne 2 Teacher PE 2008-08-30 0.75 Yes Physical ~#> 4 Jason Bourne 2 Teacher Drafting 2008-08-30 0.25 Yes Physical ~
Yes, some teachers appear twice. We ought to address this before counting employees.
A variable (or combinations of two or three variables) can be tabulated
tabyl(). The resulting data.frame can be tweaked and formatted
with the suite of
adorn_ functions for quick analysis and printing of
pretty results in a report.
adorn_ functions can be helpful with
tabyl can be called two ways:
roster %>% tabyl(subject, employee_status).
%>%pipe; this allows
tabylto be used in an analysis pipeline
table(), but pipe-able, data.frame-based, and fully featured.
roster %>%tabyl(subject)#> subject n percent valid_percent#> Basketball 1 0.08333333 0.1#> Chemistry 1 0.08333333 0.1#> Dean 1 0.08333333 0.1#> Drafting 1 0.08333333 0.1#> English 2 0.16666667 0.2#> Music 1 0.08333333 0.1#> PE 1 0.08333333 0.1#> Physics 1 0.08333333 0.1#> Science 1 0.08333333 0.1#> <NA> 2 0.16666667 NA
roster %>%filter(hire_date > as.Date("1950-01-01")) %>%tabyl(employee_status, full_time)#> employee_status No Yes#> Administration 0 1#> Coach 2 0#> Teacher 3 4
roster %>%tabyl(full_time, subject, employee_status, show_missing_levels = FALSE)#> $Administration#> full_time Dean#> Yes 1#>#> $Coach#> full_time Basketball NA_#> No 1 1#>#> $Teacher#> full_time Chemistry Drafting English Music PE Physics Science NA_#> No 0 0 2 0 0 0 1 0#> Yes 1 1 0 1 1 1 0 1
adorn_ functions dress up the results of these tabulation calls
for fast, basic reporting. Here are some of the functions that augment a
summary table for reporting:
roster %>%tabyl(employee_status, full_time) %>%adorn_totals("row") %>%adorn_percentages("row") %>%adorn_pct_formatting() %>%adorn_ns() %>%adorn_title("combined")#> employee_status/full_time No Yes#> Administration 0.0% (0) 100.0% (1)#> Coach 100.0% (2) 0.0% (0)#> Teacher 33.3% (3) 66.7% (6)#> Total 41.7% (5) 58.3% (7)
Pipe that right into
knitr::kable() in your RMarkdown report.
These modular adornments can be layered to reduce R’s deficit against Excel and SPSS when it comes to quick, informative counts.
You are welcome to:
make_clean_names()takes a character vector and returns the cleaned text, with the same functionality as the existing
clean_names(), which runs on a data.frame, manipulating its names. (#197, thanks @tazinho and everyone who contributed to the discussion).
This function can be supplied as a value for the
.name_repair argument of
as_tibble() in the
tibble package. For example:
as_tibble(iris, .name_repair = make_clean_names).
The new function
compare_df_cols() compares the names and classes of columns in a set of supplied data.frames or tibbles, reporting on the specific columns that are or are not similar. This is for the common use case where a set of data files should all have the same specifications but, in practice, may not. A companion function
compare_df_cols_same() gives a
TRUE/FALSE result indicating if the columns are the same (and therefore bindable, though FALSE is not definitive that binding will fail).
describe_class()is exported for developers who wish to extend it so that the
compare_df_functions treat their custom classes appropriately.
This feature (#50) took almost 3 years from conception to implementation. Major thanks to @billdenney for making it happen!
A new function
round_to_fraction() allows rounding to a fraction with specified denominator, e.g., to the nearest 1/7 (#235, thanks to @billdenney for suggesting & implementing).
janitor::fisher.test() to enable running these statistical tests from the base
stats package on two-way
tabyl objects. While the package loading message says the base functions are masked, the base tests still run on
table objects (#255, thanks @juba for implementing).
remove_empty() now has a companion function
remove_constant() which removes columns containing only a single unique value, optionally ignoring
NA (#222, thanks to @billdenney for suggesting & implementing).
excel_numeric_to_date() now returns a POSIXct object and includes a time zone. (#225, thanks to @billdenney for the feature.)
clean_names() can now be called on a simple features object from the
sf package. (#247, thanks to @JosiahParry for suggesting & implementing.)
adorn_totals() gains an argument
"name" that allows the user to specify a value other than "Total" to appear as the name of the added row and/or column (#263). Thanks to @StephieLaPugh for suggesting and @daniel-barnett for implementing.
remove_constant() now work with matrices (returning a matrix). (#215) Thanks to @jsta for reporting and @billdenney for patching.
If the third variable in a three-way tabyl is a factor, the resulting list is sorted in order of its levels (#250). Empty factor levels in the 3rd variable are still omitted regardless of the value of
excel_numeric_to_date() no longer gives an overflow error for integer input (for dates since 1968). (#241) Thanks to @hideaki for reporting and @billdenney for patching.
make_clean_names() now support 'none' as a case option, passed through to
snakecase::to_any_case(). (#269) Thanks to @andrewbarros for reporting and patching.
Patches a bug introduced in version 1.1.0 where
excel_numeric_to_date() would fail if given an input vector containing an
NAcorrectly, in version 1.1.0 the function would error if any values of the input vector were
NA. (#220). Thanks @emilelatour for reporting and @billdenney for patching.
This release was requested by CRAN to address some minor package dependency issues. It also contains several updates and additions described below.
The new function
row_to_names() handles the case where a dirty data file is read in with its names stored as a row of the data.frame, rather than in the names. This function sets the names of the data.frame to this row and optionally cleans up the rows above and including where the names were stored. Thanks to @billdenney for writing this feature.
excel_numeric_to_date() can now convert fractions of a day to time, e.g.,
excel_numeric_to_date(43001.01, include_time = TRUE) returns the POSIXlt value
"2017-09-23 00:14:24". Thanks to @billdenney.
As part of
excel_numeric_to_date() now handling times, if a Date-only result is requested (the default behavior of
include_time = FALSE), any fractional part of the date is now removed. The printed date itself is identical, but the internal representation of this object now contains only the integer part of the date. For example, while under both the old and new versions of this function the call
excel_numeric_to_date_old(42001.1) would return the Date object
as.numeric on this Date result would previously return
16432.1, while now it returns
This an improved behavior, as now
excel_numeric_to_date(42001.1, include_time = FALSE) == as.Date("2014-12-28") returns TRUE, while previously it would appear to be equivalent from the printed value but this comparison would return FALSE.
A stable version 1.0.0, with a new
tabyl API and with breaking changes to the output of
This builds on the original functionality of janitor, with similar-but-improved tools and significantly-changed implementation.
tabyl() is now a single function that can count combinations of one, two, or three variables, ala base R's
table(). The resulting
tabyl data.frames can be manipulated and formatted using a family of
adorn_ functions. See the tabyls vignette for more.
The now-redundant legacy functions
adorn_crosstab() have been deprecated, but remain in the package for now. Existing code that relies on the version of
tabyl present in janitor versions <= 0.3.1 will break if the
sort argument was used, as that argument no longer exists in
clean_names() now detects and preserves camelCase inputs, allows multiple options for case outputs of the cleaned names, and preserves whether there's space between letters and numbers. It also transliterates accented letters and turns
These changes may cause old code to break. E.g., a raw column name
variableName would now be converted to
VariableName, etc. depending on your preference), where previously it would have been converted to
To minimize this inconvenience, there's a quick fix for compatibility: you can find-and-replace to insert the argument
case = "old_janitor", preserving the old behavior of
clean_names() as of janitor version 0.3.1 (and thus not have to redo your scripts beyond that.)
No further changes are planned to
clean_names() and its results should be stable from version 1.0.0 onward.
clean_names() transliterates accented letters, e.g.,
cauoe (#120). Thanks to @fernandovmacedo.
clean_names() offers multiple options for variable name styling. In addition to
snake_case output you can select
ALL_CAPS and others. (#131).
clean_names(). And thanks to @maelle for proposing this feature.
Deprecated the functions
remove_empty_cols(), which are replaced by the single function
remove_empty()prints a message if no value is supplied for the
whichargument; to suppress this, supply a value to
which, even if it's the default
adorn_title() function adds the name of the 2nd
tabyl variable (i.e., the name of the column variable). This un-tidies the data.frame but makes the result clearer to readers (#77)
round_half_up()is now exported for public use. It's an exact implementation of http://stackoverflow.com/questions/12688717/round-up-from-5-in-r/12688836#12688836, written by @mrdwab.
tabylobjects now print with row numbers suppressed
clean_names()now retains the character
"number"in the resulting names
adorn_totals("row")handles quirky variable names in 1st column (#118)
get_dupes()returns the correct result when a variable in the input data.frame is already called
This is a bug-fix release with no new functionality or changes. It fixes a bug where
adorn_crosstab() failed if the
tibble package was version > 1.4.
Major changes to janitor are currently in development on GitHub and will be released soon. This is not that next big release.
The primary purpose of this release is to maintain accuracy given breaking changes to the dplyr package, upon which janitor is built, in dplyr version >0.6.0. This update also contains a number of minor improvements.
Critical: if you update the package
dplyr to version >0.6.0, you must update janitor to version 0.3.0 to ensure accurate results from janitor's
tabyl() function. This is due to a change in the behavior of dplyr's
_join functions (discussed in #111).
janitor 0.3.0 is compatible with this new version of dplyr as well as old versions of dplyr back to 0.5.0. That is, updating janitor to 0.3.0 does not necessitate an update to dplyr >0.6.0.
add_totals_colwere combined into a single function,
adorn_totals(). (#57). The
add_totals_functions are now deprecated and should not be used.
adorn_crosstab()is now "dat" instead of "crosstab" (indicating that the function can be called on any data.frame, not just a result of
%>%pipe from magrittr (#107).
Deprecated the following functions:
add_totals_col()- replaced by the single function
ns_to_percents()can now be called on data.frames that have non-numeric columns beyond the first one (those columns will be ignored) (#57)
adorn_totals("col")retains factor class in 1st column if 1st column in the input data.frame was a factor
clean_names()now handles leading spaces (#85)
ns_to_percents()work on a 2-column data.frame (#89)
adorn_totals()now works on a grouped tibble (#97)
NA_column in the result of a
crosstab()will appear at the last column position (#109)
crosstab()now appear in the package manual (#65)
crosstab()failed to retain ill-formatted variable names only when using R 3.2.5 for Windows (#76)
add_totals_row()works on two-column data.frame (#69)
use_first_valid_of()returns POSIXct-class result when given POSIXct inputs
Submitted to CRAN!
tabyl()for factor levels that aren't present is now
mtcars %>% tabyl(mpg) %>% tabyl(n)(#54)
get_dupes()now works on variables with spaces in column names (#62)
adorn_crosstab()that formats the results of a
crosstab()for pretty printing. Shows % and N in the same cell, with the % symbol, user-specified rounding (method and number of digits), and the option to include a totals row and/or column. E.g.,
mtcars %>% crosstab(cyl, gear) %>% adorn_crosstab().
crosstab()can be called in a
mtcars %>% crosstab(cyl, gear). Thanks to @chrishaid (#34)
tabyl()can also be called in a
mtcars %>% tabyl(cyl)(#35)
crosstab()returns 0 instead of NA when there are no instances of a variable combination.
tabyl(df$vecname)retains the more-descriptive
$symbol in the column name of the result - if you want a legal R name in the result, call it as
df %>% tabyl(vecname)