The main janitor functions can: perfectly format data.frame column names; provide quick counts of variable combinations (i.e., frequency tables and crosstabs); and isolate duplicate records. Other janitor functions nicely format the tabulation results. These tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel. This package follows the principles of the "tidyverse" and works well with the pipe function %>%. janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
-- "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insight" - The New York Times, 2014
janitor has simple functions for examining and cleaning dirty data. It was built with beginning and intermediate R users in mind and is optimized for user-friendliness. Advanced R users can already do everything covered here, but with janitor they can do it faster and save their thinking for the fun stuff.
The main janitor functions:
The tabulate-and-report functions approximate popular features of SPSS and Microsoft Excel.
You can install:
the most recent officially-released version from CRAN with
the latest development version from GitHub with
Below are quick examples of how janitor tools are commonly used. A full description of each function can be found in janitor's catalog of functions vignette.
Take this roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx:
Here's that data after being read in to R:
library(pacman) # for loading packagesp_load(readxl, janitor, dplyr, here)roster_raw <- read_excel(here("dirty_data.xlsx")) # available atglimpse(roster_raw)#> Observations: 13#> Variables: 11#> $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", N...#> $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lam...#> $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Tea...#> $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English",...#> $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 347...#> $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80#> $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", ...#> $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA#> $ Certification <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science ...#> $ Certification__1 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", N...#> $ Certification__2 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
Excel formatting led to an untitled empty column and 5 empty rows at the bottom of the table (only 12 records have any actual data). Bad column names are preserved.
Clean it with janitor functions:
roster <- roster_raw %>%clean_names() %>%remove_empty(c("rows", "cols")) %>%mutate(hire_date = excel_numeric_to_date(hire_date),cert = coalesce(certification, certification_1)) %>% # from dplyrselect(-certification, -certification_1) # drop unwanted columnsroster#> # A tibble: 12 x 8#> first_name last_name employee_status subject hire_date percent_allocated full_time cert#> <chr> <chr> <chr> <chr> <date> <dbl> <chr> <chr>#> 1 Jason Bourne Teacher PE 2008-08-30 0.750 Yes Physical ed#> 2 Jason Bourne Teacher Drafting 2008-08-30 0.250 Yes Physical ed#> 3 Alicia Keys Teacher Music 2001-08-15 1.00 Yes Instr. music#> 4 Ada Lovelace Teacher <NA> 1975-05-01 1.00 Yes PENDING#> 5 Desus Nice Administration Dean 2013-06-06 1.00 Yes PENDING#> 6 Chien-Shiung Wu Teacher Physics 1930-03-20 0.500 Yes Science 6-12#> 7 Chien-Shiung Wu Teacher Chemistry 1930-03-20 0.500 Yes Science 6-12#> 8 James Joyce Teacher English 1990-05-01 0.500 No English 6-12#> 9 Hedy Lamarr Teacher Science 1976-06-08 0.500 No PENDING#> 10 Carlos Boozer Coach Basketball 2015-08-05 NA No Physical ed#> 11 Young Boozer Coach <NA> 1995-01-01 NA No Political sci.#> 12 Micheal Larsen Teacher English 2009-09-15 0.800 No Vocal music
The core janitor cleaning function is
clean_names() - call it whenever you load data into R.
get_dupes() to identify and examine duplicate records during data cleaning. Let's see if any teachers are listed more than once:
roster %>% get_dupes(first_name, last_name)#> # A tibble: 4 x 9#> first_name last_name dupe_count employee_status subject hire_date percent_allocated full_time cert#> <chr> <chr> <int> <chr> <chr> <date> <dbl> <chr> <chr>#> 1 Chien-Shiung Wu 2 Teacher Physics 1930-03-20 0.500 Yes Science…#> 2 Chien-Shiung Wu 2 Teacher Chemistry 1930-03-20 0.500 Yes Science…#> 3 Jason Bourne 2 Teacher PE 2008-08-30 0.750 Yes Physica…#> 4 Jason Bourne 2 Teacher Drafting 2008-08-30 0.250 Yes Physica…
Yes, some teachers appear twice. We ought to address this before counting employees.
A variable (or combinations of two or three variables) can be tabulated with
tabyl(). The resulting data.frame can be tweaked and formatted with the suite of
adorn_ functions for quick analysis and printing of pretty results in a report.
adorn_ functions can be helpful with non-tabyls, too.
tabyl can be called two ways:
roster %>% tabyl(subject, employee_status).
%>%pipe; this allows
tabylto be used in an analysis pipeline
table(), but pipe-able, data.frame-based, and fully featured.
roster %>%tabyl(subject)#> subject n percent valid_percent#> Basketball 1 0.08333333 0.1#> Chemistry 1 0.08333333 0.1#> Dean 1 0.08333333 0.1#> Drafting 1 0.08333333 0.1#> English 2 0.16666667 0.2#> Music 1 0.08333333 0.1#> PE 1 0.08333333 0.1#> Physics 1 0.08333333 0.1#> Science 1 0.08333333 0.1#> <NA> 2 0.16666667 NA
roster %>%filter(hire_date > as.Date("1950-01-01")) %>%tabyl(employee_status, full_time)#> employee_status No Yes#> Administration 0 1#> Coach 2 0#> Teacher 3 4
roster %>%tabyl(full_time, subject, employee_status, show_missing_levels = FALSE)#> $Administration#> full_time Dean#> Yes 1#>#> $Coach#> full_time Basketball NA_#> No 1 1#>#> $Teacher#> full_time Chemistry Drafting English Music PE Physics Science NA_#> No 0 0 2 0 0 0 1 0#> Yes 1 1 0 1 1 1 0 1
adorn_ functions dress up the results of these tabulation calls for fast, basic reporting. Here are some of the functions that augment a summary table for reporting:
roster %>%tabyl(employee_status, full_time) %>%adorn_totals("row") %>%adorn_percentages("row") %>%adorn_pct_formatting() %>%adorn_ns() %>%adorn_title("combined")#> employee_status/full_time No Yes#> Administration 0.0% (0) 100.0% (1)#> Coach 100.0% (2) 0.0% (0)#> Teacher 33.3% (3) 66.7% (6)#> Total 41.7% (5) 58.3% (7)
Pipe that right into
knitr::kable() in your RMarkdown report.
These modular adornments can be layered to reduce R's deficit against Excel and SPSS when it comes to quick, informative counts.
You are welcome to:
Patches a bug introduced in version 1.1.0 where
excel_numeric_to_date() would fail if given an input vector containing an
NAcorrectly, in version 1.1.0 the function would error if any values of the input vector were
This release was requested by CRAN to address some minor package dependency issues. It also contains several updates and additions described below.
The new function
row_to_names() handles the case where a dirty data file is read in with its names stored as a row of the data.frame, rather than in the names. This function sets the names of the data.frame to this row and optionally cleans up the rows above and including where the names were stored. Thanks to @billdenney for writing this feature.
excel_numeric_to_date() can now convert fractions of a day to time, e.g.,
excel_numeric_to_date(43001.01, include_time = TRUE) returns the POSIXlt value
"2017-09-23 00:14:24". Thanks to @billdenney.
As part of
excel_numeric_to_date() now handling times, if a Date-only result is requested (the default behavior of
include_time = FALSE), any fractional part of the date is now removed. The printed date itself is identical, but the internal representation of this object now contains only the integer part of the date. For example, while under both the old and new versions of this function the call
excel_numeric_to_date_old(42001.1) would return the Date object
as.numeric on this Date result would previously return
16432.1, while now it returns
This an improved behavior, as now
excel_numeric_to_date(42001.1, include_time = FALSE) == as.Date("2014-12-28") returns TRUE, while previously it would appear to be equivalent from the printed value but this comparison would return FALSE.
A stable version 1.0.0, with a new
tabyl API and with breaking changes to the output of
This builds on the original functionality of janitor, with similar-but-improved tools and significantly-changed implementation.
tabyl() is now a single function that can count combinations of one, two, or three variables, ala base R's
table(). The resulting
tabyl data.frames can be manipulated and formatted using a family of
adorn_ functions. See the tabyls vignette for more.
The now-redundant legacy functions
adorn_crosstab() have been deprecated, but remain in the package for now. Existing code that relies on the version of
tabyl present in janitor versions <= 0.3.1 will break if the
sort argument was used, as that argument no longer exists in
clean_names() now detects and preserves camelCase inputs, allows multiple options for case outputs of the cleaned names, and preserves whether there's space between letters and numbers. It also transliterates accented letters and turns
These changes may cause old code to break. E.g., a raw column name
variableName would now be converted to
VariableName, etc. depending on your preference), where previously it would have been converted to
To minimize this inconvenience, there's a quick fix for compatibility: you can find-and-replace to insert the argument
case = "old_janitor", preserving the old behavior of
clean_names() as of janitor version 0.3.1 (and thus not have to redo your scripts beyond that.)
No further changes are planned to
clean_names() and its results should be stable from version 1.0.0 onward.
clean_names() transliterates accented letters, e.g.,
cauoe (#120). Thanks to @fernandovmacedo.
clean_names() offers multiple options for variable name styling. In addition to
snake_case output you can select
ALL_CAPS and others. (#131).
clean_names(). And thanks to @maelle for proposing this feature.
Deprecated the functions
remove_empty_cols(), which are replaced by the single function
remove_empty()prints a message if no value is supplied for the
whichargument; to suppress this, supply a value to
which, even if it's the default
adorn_title() function adds the name of the 2nd
tabyl variable (i.e., the name of the column variable). This un-tidies the data.frame but makes the result clearer to readers (#77)
round_half_up()is now exported for public use. It's an exact implementation of http://stackoverflow.com/questions/12688717/round-up-from-5-in-r/12688836#12688836, written by @mrdwab.
tabylobjects now print with row numbers suppressed
clean_names()now retains the character
"number"in the resulting names
adorn_totals("row")handles quirky variable names in 1st column (#118)
get_dupes()returns the correct result when a variable in the input data.frame is already called
This is a bug-fix release with no new functionality or changes. It fixes a bug where
adorn_crosstab() failed if the
tibble package was version > 1.4.
Major changes to janitor are currently in development on GitHub and will be released soon. This is not that next big release.
The primary purpose of this release is to maintain accuracy given breaking changes to the dplyr package, upon which janitor is built, in dplyr version >0.6.0. This update also contains a number of minor improvements.
Critical: if you update the package
dplyr to version >0.6.0, you must update janitor to version 0.3.0 to ensure accurate results from janitor's
tabyl() function. This is due to a change in the behavior of dplyr's
_join functions (discussed in #111).
janitor 0.3.0 is compatible with this new version of dplyr as well as old versions of dplyr back to 0.5.0. That is, updating janitor to 0.3.0 does not necessitate an update to dplyr >0.6.0.
add_totals_colwere combined into a single function,
adorn_totals(). (#57). The
add_totals_functions are now deprecated and should not be used.
adorn_crosstab()is now "dat" instead of "crosstab" (indicating that the function can be called on any data.frame, not just a result of
%>%pipe from magrittr (#107).
Deprecated the following functions:
add_totals_col()- replaced by the single function
ns_to_percents()can now be called on data.frames that have non-numeric columns beyond the first one (those columns will be ignored) (#57)
adorn_totals("col")retains factor class in 1st column if 1st column in the input data.frame was a factor
clean_names()now handles leading spaces (#85)
ns_to_percents()work on a 2-column data.frame (#89)
adorn_totals()now works on a grouped tibble (#97)
NA_column in the result of a
crosstab()will appear at the last column position (#109)
crosstab()now appear in the package manual (#65)
crosstab()failed to retain ill-formatted variable names only when using R 3.2.5 for Windows (#76)
add_totals_row()works on two-column data.frame (#69)
use_first_valid_of()returns POSIXct-class result when given POSIXct inputs
Submitted to CRAN!
tabyl()for factor levels that aren't present is now
mtcars %>% tabyl(mpg) %>% tabyl(n)(#54)
get_dupes()now works on variables with spaces in column names (#62)
adorn_crosstab()that formats the results of a
crosstab()for pretty printing. Shows % and N in the same cell, with the % symbol, user-specified rounding (method and number of digits), and the option to include a totals row and/or column. E.g.,
mtcars %>% crosstab(cyl, gear) %>% adorn_crosstab().
crosstab()can be called in a
mtcars %>% crosstab(cyl, gear). Thanks to @chrishaid (#34)
tabyl()can also be called in a
mtcars %>% tabyl(cyl)(#35)
crosstab()returns 0 instead of NA when there are no instances of a variable combination.
tabyl(df$vecname)retains the more-descriptive
$symbol in the column name of the result - if you want a legal R name in the result, call it as
df %>% tabyl(vecname)