A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
mutate()adds new variables that are functions of existing variables
select()picks variables based on their names.
filter()picks cases based on their values.
summarise()reduces multiple values down to a single summary.
arrange()changes the ordering of the rows.
These all combine naturally with
group_by() which allows you to
perform any operation “by group”. You can learn more about them in
vignette("dplyr"). As well as these single-table verbs, dplyr also
provides a variety of two-table verbs, which you can learn about in
dplyr is designed to abstract over how the data is stored. That means as
well as working with local data frames, you can also work with remote
database tables, using exactly the same R code. Install the dbplyr
package then read
vignette("databases", package = "dbplyr").
If you are new to dplyr, the best place to start is the data import chapter in R for data science.
# The easiest way to get dplyr is to install the whole tidyverse:install.packages("tidyverse")# Alternatively, install just dplyr:install.packages("dplyr")
dplyr 0.8.0 will be release on February 1st, you can install the release candidate from GitHub.
# install.packages("devtools")devtools::install_github("tidyverse/[email protected]_0.8.0")
To get a bug fix, or use a feature from the development version, you can install dplyr from GitHub.
library(dplyr)starwars %>%filter(species == "Droid")#> # A tibble: 5 x 13#> name height mass hair_color skin_color eye_color birth_year gender#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>#> 1 C-3PO 167 75 <NA> gold yellow 112 <NA>#> 2 R2-D2 96 32 <NA> white, bl… red 33 <NA>#> 3 R5-D4 97 32 <NA> white, red red NA <NA>#> 4 IG-88 200 140 none metal red 15 none#> 5 BB8 NA NA none none black NA none#> # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,#> # vehicles <list>, starships <list>starwars %>%select(name, ends_with("color"))#> # A tibble: 87 x 4#> name hair_color skin_color eye_color#> <chr> <chr> <chr> <chr>#> 1 Luke Skywalker blond fair blue#> 2 C-3PO <NA> gold yellow#> 3 R2-D2 <NA> white, blue red#> 4 Darth Vader none white yellow#> 5 Leia Organa brown light brown#> # … with 82 more rowsstarwars %>%mutate(name, bmi = mass / ((height / 100) ^ 2)) %>%select(name:mass, bmi)#> # A tibble: 87 x 4#> name height mass bmi#> <chr> <int> <dbl> <dbl>#> 1 Luke Skywalker 172 77 26.0#> 2 C-3PO 167 75 26.9#> 3 R2-D2 96 32 34.7#> 4 Darth Vader 202 136 33.3#> 5 Leia Organa 150 49 21.8#> # … with 82 more rowsstarwars %>%arrange(desc(mass))#> # A tibble: 87 x 13#> name height mass hair_color skin_color eye_color birth_year gender#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>#> 1 Jabb… 175 1358 <NA> green-tan… orange 600 herma…#> 2 Grie… 216 159 none brown, wh… green, y… NA male#> 3 IG-88 200 140 none metal red 15 none#> 4 Dart… 202 136 none white yellow 41.9 male#> 5 Tarf… 234 136 brown brown blue NA male#> # … with 82 more rows, and 5 more variables: homeworld <chr>,#> # species <chr>, films <list>, vehicles <list>, starships <list>starwars %>%group_by(species) %>%summarise(n = n(),mass = mean(mass, na.rm = TRUE)) %>%filter(n > 1)#> # A tibble: 9 x 3#> species n mass#> <chr> <int> <dbl>#> 1 Droid 5 69.8#> 2 Gungan 3 74#> 3 Human 35 82.8#> 4 Kaminoan 2 88#> 5 Mirialan 2 53.1#> # … with 4 more rows
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
could not find function "n" or the warning
Calling `n()` without importing or prefixing it is deprecated, use `dplyr::n()`
indicates when functions like
row_number(), ... are not imported or prefixed.
The easiest fix is to import dplyr with
import(dplyr) in your
#' @import dplyr in a roxygen comment, alternatively such functions can be
imported selectively as any other function with
importFrom(dplyr, n) in the
#' @importFrom dplyr n in a roxygen comment. The third option is
to prefix them, i.e. use
If you see
checking S3 generic/method consistency in R CMD check for your
package, note that :
Error: `.data` is a corrupt grouped_df, ... signals code that makes
wrong assumptions about the internals of a grouped data frame.
New selection helpers
group_cols(). It can be called in selection contexts
select() and matches the grouping variables of grouped tibbles.
last_col() is re-exported from tidyselect (#3584).
group_trim() drops unused levels of factors that are used as grouping variables.
nest_join() creates a list column of the matching rows.
is equivalent to
group_nest() is similar to
tidyr::nest() but focusing on the variables to nest by
instead of the nested columns.
starwars %>%group_by(species, homeworld) %>%group_nest()starwars %>%group_nest(species, homeworld)
group_split() is similar to
base::split() but operating on existing groups when
applied to a grouped data frame, or subject to the data mask on ungrouped data frames
starwars %>%group_by(species, homeworld) %>%group_split()starwars %>%group_split(species, homeworld)
group_walk() are purrr-like functions to iterate on groups
of a grouped data frame, jointly identified by the data subset (exposed as
.x) and the
data key (a one row tibble, exposed as
group_map() returns a grouped data frame that
combines the results of the function,
group_walk() is only used for side effects and returns
its input invisibly.
mtcars %>%group_by(cyl) %>%group_map(~ head(.x, 2L))
distinct_prepare(), previously known as
distinct_vars() is exported. This is mostly useful for
alternative backends (e.g.
group_by() gains the
.drop argument. When set to
FALSE the groups are generated
based on factor levels, hence some groups may be empty (#341).
# 3 groupstibble(x = 1:2,f = factor(c("a", "b"), levels = c("a", "b", "c"))) %>%group_by(f, .drop = FALSE)# the order of the grouping variables matterdf <- tibble(x = c(1,2,1,2),f = factor(c("a", "b", "a", "b"), levels = c("a", "b", "c")))df %>% group_by(f, x, .drop = FALSE)df %>% group_by(x, f, .drop = FALSE)
The default behaviour drops the empty groups as in the previous versions.
tibble(x = 1:2,f = factor(c("a", "b"), levels = c("a", "b", "c"))) %>%group_by(f)
slice() gain a
.preserve argument to control which groups it should keep. The default
filter(.preserve = FALSE) recalculates the grouping structure based on the resulting data,
otherwise it is kept as is.
df <- tibble(x = c(1,2,1,2),f = factor(c("a", "b", "a", "b"), levels = c("a", "b", "c"))) %>%group_by(x, f, .drop = FALSE)df %>% filter(x == 1)df %>% filter(x == 1, .preserve = TRUE)
The notion of lazily grouped data frames have disappeared. All dplyr verbs now recalculate immediately the grouping structure, and respect the levels of factors.
Subsets of columns now properly dispatch to the
[[ method when the column
is an object (a vector with a class) instead of making assumptions on how the
column should be handled. The
[ method must handle integer indices, including
x[NA_integer_] should produce a vector of the same class
x with whatever represents a missing value.
tally() works correctly on non-data frame table sources such as
sample_frac() can use
distinct() respects the order of the variables provided (#3195, @foo-bar-baz-qux)
and handles the 0 rows and 0 columns special case (#2954).
combine() uses tidy dots (#3407).
group_indices() can be used without argument in expressions in verbs (#1185).
with grouped tibbles now informs you that the grouping variables are
ignored. In the case of the
_all() verbs, the message invites you to use
mutate_at(df, vars(-group_cols())) (or the equivalent
instead if you'd like to make it explicit in your code that the operation is
not applied on the grouping variables.
Scoped variants of
arrange() respect the
.by_group argument (#3504).
last() hybrid functions fall back to R evaluation when given no arguments (#3589).
mutate() removes a column when the expression evaluates to
NULL for all groups (#2945).
grouped data frames support
[, drop = TRUE] (#3714).
New low-level constructor
new_grouped_df() and validator
glimpse() prints group information on grouped tibbles (#3384).
Scoped filter variants now support functions and purrr-like lambdas:
mtcars %>% filter_at(vars(hp, vs), ~ . %% 2 == 0)
combine() are questioning (#3494).
funs() is soft-deprecated and will start issuing warnings in a future version.
Scoped variants for
summarise_at() excludes the grouping variables (#3613).
summarise_at() handle utf-8 names (#2967).
R expressions that cannot be handled with native code are now evaluated with unwind-protection when available (on R 3.5 and later). This improves the performance of dplyr on data frames with many groups (and hence many expressions to evaluate). We benchmarked that computing a grouped average is consistently twice as fast with unwind-protection enabled.
Unwind-protection also makes dplyr more robust in corner cases because it ensures the C++ destructors are correctly called in all circumstances (debugger exit, captured condition, restart invokation).
Improved performance for wide tibbles (#3335).
sd() for logical vectors (#3189).
Hybrid version of
sum(na.rm = FALSE) exits early when there are missing values.
This considerably improves performance when there are missing values early in the vector (#3288).
group_by() does not trigger the additional
mutate() on simple uses of the
.data pronoun (#3533).
The grouping metadata of grouped data frame has been reorganized in a single tidy tibble, that can be accessed
with the new
group_data() function. The grouping tibble consists of one column per grouping variable,
followed by a list column of the (1-based) indices of the groups. The new
group_rows() function retrieves
that list of indices (#3489).
# the grouping metadata, as a tibblegroup_by(starwars, homeworld) %>%group_data()# the indicesgroup_by(starwars, homeworld) %>%group_data() %>%pull(.rows)group_by(starwars, homeworld) %>%group_rows()
Hybrid evaluation has been completely redesigned for better performance and stability.
Add documentation example for moving variable to back in
column wise functions are better documented, in particular explaining when grouping variables are included as part of the selection.
exprs() is no longer exported to avoid conflicts with
The MASS package is explicitly suggested to fix CRAN warnings on R-devel (#3657).
Set operations like
setdiff() reconstruct groups metadata (#3587) and keep the order of the rows (#3839).
Using namespaced calls to
base::unique() from C++ code
to avoid ambiguities when these functions are overridden (#3644).
Fix rchk errors (#3693).
The major change in this version is that dplyr now depends on the selecting
backend of the tidyselect package. If you have been linking to
dplyr::select_helpers documentation topic, you should update the link to
Another change that causes warnings in packages is that dplyr now exports the
exprs() function. This causes a collision with
import functions from dplyr selectively rather than in bulk, or do not import
Biobase::exprs() and refer to it with a namespace qualifier.
distinct(data, "string") now returns a one-row data frame again. (The
previous behavior was to return the data unchanged.)
do() operations with more than one named argument can access
Reindexing grouped data frames (e.g. after
never updates the
"class" attribute. This also avoids unintended updates
to the original object (#3438).
Fixed rare column name clash in
..._join() with non-join
columns of the same name in both tables (#3266).
row_number() ordering to use the locale-dependent
ordering functions in R when dealing with character vectors, rather than
always using the C-locale ordering function in C (#2792, @foo-bar-baz-qux).
Summaries of summaries (such as
summarise(b = sum(a), c = sum(b))) are
now computed using standard evaluation for simplicity and correctness, but
slightly slower (#3233).
summarise() for empty data frames with zero columns (#3071).
syms() are now
syms() construct symbols from strings or character
expr() variants are equivalent to
enquo() but return simple expressions rather than quosures. They support
dplyr now depends on the new tidyselect package to power
pull() and their variants (#2896). Consequently
soft-deprecated and will start issuing warnings in a future version.
Following the switch to tidyselect,
rename() fully support
character vectors. You can now unquote variables like this:
vars <- c("disp", "cyl") select(mtcars, !! vars) select(mtcars, -(!! vars))
Note that this only works in selecting functions because in other contexts
strings and character vectors are ambiguous. For instance strings are a valid
input in mutating operations and
mutate(df, "foo") creates a new column by
recycling "foo" to the number of rows.
Support for raw vector columns in
raw support initially) (#1803).
bind_cols() handles unnamed list (#3402).
bind_rows() works around corrupt columns that have the object bit set
while having no class attribute (#3349).
logical() when all inputs are
NULL (or when there
are no inputs) (#3365, @zeehio).
distinct() now supports renaming columns (#3234).
Hybrid evaluation simplifies
foo() (#3309). Hybrid
functions can now be masked by regular R functions to turn off hybrid
evaluation (#3255). The hybrid evaluator finds functions from dplyr even if
dplyr is not attached (#3456).
mutate() it is now illegal to use
data.frame in the rhs (#3298).
row_number() works on empty subsets (#3454).
vars() now treat
NULL as empty inputs (#3023).
Scoped select and rename functions (
now work with grouped data frames, adapting the grouping as necessary
group_by_at() can group by an existing grouping variable
arrange_at() can use grouping variables (#3332).
slice() no longer enforce tibble classes when input is a simple
data.frame, and ignores 0 (#3297, #3313).
transmute() no longer prints a message when including a group variable.
funs()(#3094) and set operations (e.g.
union()) (#3238, @edublancas).
Better error message if dbplyr is not installed when accessing database backends (#3225).
arrange() fails gracefully on
data.frame columns (#3153).
Corrected error message when calling
cbind() with an object of wrong
Add warning with explanation to
distinct() if any of the selected columns
are of type
list (#3088, @foo-bar-baz-qux), or when used on unknown columns
Show clear error message for bad arguments to
Better error message in
..._join() when joining data frames with duplicate
NA column names. Joining such data frames with a semi- or anti-join
now gives a warning, which may be converted to an error in future versions
Dedicated error message when trying to use columns of the
Period classes (#2568).
.onDetach() hook that allows for plyr to be loaded and attached
without the warning message that says functions in dplyr will be masked,
since dplyr is no longer attached (#3359, @jwnorman).
sample_frac()on grouped data frame are now faster especially for those with large number of groups (#3193, @saurfang).
Compute variable names for joins in R (#3430).
Bumped Rcpp dependency to 0.12.15 to avoid imperfect detection of
values in hybrid evaluation fixed in RcppCore/Rcpp#790 (#2919).
Avoid cleaning the data mask, a temporary environment used to evaluate
expressions. If the environment, in which e.g. a
is evaluated, is preserved until after the operation, accessing variables
from that environment now gives a warning but still returns
Fix recent Fedora and ASAN check errors (#3098).
Avoid dependency on Rcpp 0.12.10 (#3106).
Fixed protection error that occurred when creating a character column using grouped
Fixed a rare problem with accessing variable values in
summarise() when all groups have size one (#3050).
distinct() now throws an error when used on unknown columns
Fixed rare out-of-bounds memory write in
slice() when negative indices beyond the number of rows were involved (#3073).
summarise() no longer change the grouped vars of the original data (#3038).
nth(default = var),
first(default = var) and
last(default = var) fall back to standard evaluation in a grouped operation instead of triggering an error (#3045).
case_when() now works if all LHS are atomic (#2909), or when LHS or RHS values are zero-length vectors (#3048).
NA on the LHS (#2927).
Semi- and anti-joins now preserve the order of left-hand-side data frame (#3089).
Improved error message for invalid list arguments to
Grouping by character vectors is now faster (#2204).
Fixed a crash that occurred when an unexpected input was supplied to
call argument of
Use new versions of bindrcpp and glue to avoid protection problems. Avoid wrapping arguments to internal error functions (#2877). Fix two protection mistakes found by rchk (#2868).
Fix C++ error that caused compilation to fail on mac cran (#2862)
Fix undefined behaviour in
assigned instead of
NA_LOGICAL. (#2855, @zeehio)
top_n() now executes operations lazily for compatibility with
database backends (#2848).
Reuse of new variables created in ungrouped
again, regression introduced in dplyr 0.7.0 (#2869).
Quosured symbols do not prevent hybrid handling anymore. This should fix many performance issues introduced with tidyeval (#2822).
Five new datasets provide some interesting built-in datasets to demonstrate dplyr verbs (#2094):
starwarsdataset about starwars characters; has list columns
stormshas the trajectories of ~200 tropical storms
band_instruments2has some simple data to demonstrate joins.
add_tally() for adding an
n column within groups
arrange() for grouped data frames gains a
.by_group argument so you
can choose to sort by groups if you want to (defaults to
pull() generic for extracting a single column either by name or position
(either from the left or the right). Thanks to @paulponcet for the idea (#2054).
This verb is powered with the new
select_var() internal helper,
which is exported as well. It is like
select_vars() but returns a
as_tibble() is re-exported from tibble. This is the recommend way to create
tibbles from existing data frames.
tbl_df() has been softly deprecated.
tribble() is now imported from tibble (#2336, @chrMongeau); this
is now prefered to
dplyr no longer messages that you need dtplyr to work with data.table (#2489).
summarise_each_q() functions have been removed.
failwith(). I'm not even sure why it was here.
summarise_each(), these functions
print a message which will be changed to a warning in the next release.
.env argument to
sample_frac() is defunct,
passing a value to this argument print a message which will be changed to a
warning in the next release.
This version of dplyr includes some major changes to how database connections work. By and large, you should be able to continue using your existing dplyr database code without modification, but there are two big changes that you should be aware of:
Almost all database related code has been moved out of dplyr and into a
new package, dbplyr. This makes dplyr
simpler, and will make it easier to release fixes for bugs that only affect
src_sqlite() will still
live dplyr so your existing code continues to work.
It is no longer necessary to create a remote "src". Instead you can work directly with the database connection returned by DBI. This reflects the maturity of the DBI ecosystem. Thanks largely to the work of Kirill Muller (funded by the R Consortium) DBI backends are now much more consistent, comprehensive, and easier to use. That means that there's no longer a need for a layer in between you and DBI.
You can continue to use
src_sqlite(), but I recommend a new style that makes the connection to DBI more clear:
library(dplyr)con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")DBI::dbWriteTable(con, "mtcars", mtcars)mtcars2 <- tbl(con, "mtcars")mtcars2
This is particularly useful if you want to perform non-SELECT queries as you can do whatever you want with
If you've implemented a database backend for dplyr, please read the backend news to see what's changed from your perspective (not much). If you want to ensure your package works with both the current and previous version of dplyr, see
wrap_dbplyr_obj() for helpers.
Internally, column names are always represented as character vectors, and not as language symbols, to avoid encoding problems on Windows (#1950, #2387, #2388).
Error messages and explanations of data frame inequality are now encoded in UTF-8, also on Windows (#2441).
Joins now always reencode character columns to UTF-8 if necessary. This gives a nice speedup, because now pointer comparison can be used instead of string comparison, but relies on a proper encoding tag for all strings (#2514).
Fixed problems when joining factor or character encodings with a mix of native and UTF-8 encoded values (#1885, #2118, #2271, #2451).
group_by() for data frames that have UTF-8 encoded names (#2284, #2382).
group_vars() generic that returns the grouping as character vector, to
avoid the potentially lossy conversion to language symbols. The list returned
group_by_prepare() now has a new
group_names component (#1950, #2384).
transmute() now have scoped variants (verbs suffixed with
these variants apply an operation to a selection of variables.
The scoped verbs taking predicates (
etc) now support S3 objects and lazy tables. S3 objects should
implement methods for
tbl_vars(). For lazy
tables, the first 100 rows are collected and the predicate is
applied on this subset of the data. This is robust for the common
case of checking the type of a column (#2129).
Summarise and mutate colwise functions pass
... on the the manipulation
The performance of colwise verbs like
mutate_all() is now back to
where it was in
funs() has better handling of namespaced functions (#2089).
Fix issue with
summarise_if() when a predicate
function returns a vector of
FALSE (#1989, #2009, #2011).
dplyr has a new approach to non-standard evaluation (NSE) called tidyeval.
It is described in detail in
vignette("programming") but, in brief, gives you
the ability to interpolate values in contexts where dplyr usually works with expressions:
my_var <- quo(homeworld)starwars %>%group_by(!!my_var) %>%summarise_at(vars(height:mass), mean, na.rm = TRUE)
This means that the underscored version of each main verb is no longer needed, and so these functions have been deprecated (but remain around for backward compatibility).
sample_frac() now use
tidyeval to capture their arguments by expression. This makes it
possible to use unquoting idioms (see
fixes scoping issues (#2297).
Most verbs taking dots now ignore the last argument if empty. This makes it easier to copy lines of code without having to worry about deleting trailing commas (#1039).
[API] The new
.env environments can be used inside
all verbs that operate on data:
.data$column_name accesses the column
.env$var accesses the external variable
Columns or external variables named
.env are shadowed, use
.env$... to access them. (
.data implements strict
matching also for the
$ operator (#2591).)
global() functions have been removed. They were never
documented officially. Use the new
.env environments instead.
Expressions in verbs are now interpreted correctly in many cases that
failed before (e.g., use of
case_when(), nonstandard evaluation, ...).
These expressions are now evaluated in a specially constructed temporary
environment that retrieves column data on demand with the help of the
bindrcpp package (#2190). This temporary environment poses restrictions on
<- inside verbs. To prevent leaking of broken bindings,
the temporary environment is cleared after the evaluation (#2435).
xxx_join.tbl_df(na_matches = "never") treats all
NA values as
different from each other (and from any other value), so that they never
match. This corresponds to the behavior of joins for database sources,
and of database joins in general. To match
NA values, pass
na_matches = "na" to the join verbs; this is only supported for data frames.
The default is
na_matches = "na", kept for the sake of compatibility
to v0.5.0. It can be tweaked by calling
pkgconfig::set_config("dplyr::na_matches", "na") (#2033).
common_by() gets a better error message for unexpected inputs (#2091)
Fix groups when joining grouped data frames with duplicate columns (#2330, #2334, @davidkretch).
One of the two join suffixes can now be an empty string, dplyr no longer hangs (#2228, #2445).
Anti- and semi-joins warn if factor levels are inconsistent (#2741).
Warnings about join column inconsistencies now contain the column names (#2728).
For selecting variables, the first selector decides if it's an inclusive
selection (i.e., the initial column list is empty), or an exclusive selection
(i.e., the initial column list contains all columns). This means that
select(mtcars, contains("am"), contains("FOO"), contains("vs")) now returns
vs columns like in dplyr 0.4.3 (#2275, #2289, @r2evans).
Select helpers now throw an error if called when no variables have been set (#2452)
Helper functions in
select() (and related verbs) are now evaluated
in a context where column names do not exist (#2184).
select() (and the internal function
select_vars()) now support
column names in addition to column positions. As a result,
select(mtcars, "cyl") are now allowed.
coalesce() now support splicing of
arguments with rlang's
count() now preserves the grouping of its input (#2021).
distinct() no longer duplicates variables (#2001).
distinct() with a grouped data frame works the same way as
distinct() on an ungrouped data frame, namely it uses all
copy_to() now returns it's output invisibly (since you're often just
calling for the side-effect).
lag() throw informative error if used with ts objects (#2219)
mutate() recycles list columns of length 1 (#2171).
mutate() gives better error message when attempting to add a non-vector
column (#2319), or attempting to remove a column with
NULL (#2187, #2439).
summarise() now correctly evaluates newly created factors (#2217), and
can create ordered factors (#2200).
summarise() uses summary variables correctly (#2404, #2453).
summarise() no longer converts character
NA to empty strings (#1839).
all_equal() now reports multiple problems as a character vector (#1819, #2442).
all_equal() checks that factor levels are equal (#2440, #2442).
bind_cols() give an error for database tables (#2373).
bind_rows() works correctly with
NULL arguments and an
(#2056), and also for zero-column data frames (#2175).
combine() are more strict when coercing.
Logical values are no longer coerced to integer and numeric. Date, POSIXct
and other integer or double-based classes are no longer coerced to integer or
double as there is chance of attributes or information being lost
bind_cols() now calls
tibble::repair_names() to ensure that all
names are unique (#2248).
bind_cols() handles empty argument list (#2048).
bind_cols() better handles
NULL inputs (#2303, #2443).
bind_rows() explicitly rejects columns containing data frames
bind_cols() now accept vectors. They are treated
as rows by the former and columns by the latter. Rows require inner
c(col1 = 1, col2 = 2), while columns require outer
col1 = c(1, 2). Lists are still treated as data frames but
can be spliced explicitly with
bind_rows(!!! x) (#1676).
rbind_all() now call
.Deprecated(), they will be removed
in the next CRAN release. Please use
NA values (#2203, @zeehio)
bind_rows() with character and factor types now always warn
about the coercion to character (#2317, @zeehio)
mutate coerces results from grouped dataframes accepting combinable data
types (such as
numeric). (#1892, @zeehio)
%in% gets new hybrid handler (#126).
between() returns NA if
NA (fixes #2562).
NA values (#2000, @tjmahr).
nth() have better default values for factor,
Dates, POSIXct, and data frame inputs (#2029).
Fixed segmentation faults in hybrid evaluation of
lag(). These functions now always fall back to the R
implementation if called with arguments that the hybrid evaluator cannot
handle (#948, #1980).
n_distinct() gets larger hash tables given slightly better performance (#977).
ntile() are more careful about proper data types of their return values (#2306).
NA when computing group membership (#2564).
lag() enforces integer
n (#2162, @kevinushey).
max() now always return a
numeric and work correctly
in edge cases (empty input, all
NA, ...) (#2305, #2436).
min_rank("string") no longer segfaults in hybrid evaluation (#2279, #2444).
recode() can now recode a factor to other types (#2268)
.dots argument to support passing replacements as list
Many error messages are more helpful by referring to a column name or a position in the argument list (#2448).
is_grouped_df() alias to
tbl_vars() now has a
group_vars argument set to
FALSE, group variables are not returned.
Fixed segmentation fault after calling
rename() on an invalid grouped
data frame (#2031).
rename_vars() gains a
strict argument to control if an
error is thrown when you try and rename a variable that doesn't
Fixed undefined behavior for
slice() on a zero-column data frame (#2490).
Fixed very rare case of false match during join (#2515).
Restricted workaround for
match() to R 3.3.0. (#1858).
dplyr now warns on load when the version of R or Rcpp during installation is different to the currently installed version (#2514).
Fixed improper reuse of attributes when creating a list column in
summarise() always strip the
names attribute from new
or updated columns, even for ungrouped operations (#1689).
Fixed rare error that could lead to a segmentation fault in
all_equal(ignore_col_order = FALSE) (#2502).
The "dim" and "dimnames" attributes are always stripped when copying a vector (#1918, #2049).
rowwise are registered officially as S3 classes.
This makes them easier to use with S4 (#2276, @joranE, #2789).
All operations that return tibbles now include the
This is important for correct printing with tibble 1.3.1 (#2789).
Makeflags uses PKG_CPPFLAGS for defining preprocessor macros.
astyle formatting for C++ code, tested but not changed as part of the tests (#2086, #2103).
Update RStudio project settings to install tests (#1952).
Rcpp::interfaces() to register C callable interfaces, and registering all native exported functions via
useDynLib(.registration = TRUE) (#2146).
Formatting of grouped data frames now works by overriding the
tbl_sum() generic instead of
print(). This means that the output is more consistent with tibble, and that
format() is now supported also for SQL sources (#2781).
arrange() once again ignores grouping (#1206).
distinct() now only keeps the distinct variables. If you want to return
all variables (using the first row for non-distinct values) use
.keep_all = TRUE (#1110). For SQL sources,
.keep_all = FALSE is
GROUP BY, and
.keep_all = TRUE raises an error
(#1937, #1942, @krlmlr). (The default behaviour of using all variables
when none are specified remains - this note only applies if you select
The select helper functions
ends_with() etc are now
real exported functions. This means that you'll need to import those
functions if you're using from a package where dplyr is not attached.
dplyr::select(mtcars, starts_with("m")) used to work, but
now you'll need
The long deprecated
%.% have been removed.
id() has been deprecated. Please use
rbind_list() are formally deprecated. Please use
bind_rows() instead (#803).
Outdated benchmarking demos have been removed (#1487).
Code related to starting and signalling clusters has been moved out to multidplyr.
coalesce() finds the first non-missing value from a set of vectors.
(#1666, thanks to @krlmlr for initial implementation).
case_when() is a general vectorised if + else if (#631).
if_else() is a vectorised if statement: it's a stricter (type-safe),
faster, and more predictable version of
ifelse(). In SQL it is
translated to a
na_if() makes it easy to replace a certain value with an
In SQL it is translated to
near(x, y) is a helper for
abs(x - y) < tol (#1607).
recode() is vectorised equivalent to
union_all() method. Maps to
UNION ALL for SQL sources,
for data frames/tbl_dfs, and
combine() for vectors (#1045).
A new family of functions replace
mutate_each() (which will thus be deprecated in a future release).
mutate_all() apply a function to all columns
mutate_at() operate on a subset of
columns. These columuns are selected with either a character vector
of columns names, a numeric vector of column positions, or a column
select() semantics generated by the new
columns() helper. In addition,
take a predicate function or a logical vector (these verbs currently
require local sources). All these functions can now take ordinary
functions instead of a list of functions generated by
(though this is only useful for local sources). (#1845, @lionel-)
select_if() lets you select columns with a predicate function.
Only compatible with local sources. (#497, #1569, @lionel-)
All data table related code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package. If both data.table and dplyr are loaded, you'll get a message reminding you to load dtplyr.
Functions related to the creation and coercion of
tbl_dfs, now live in their own package: tibble. See
vignette("tibble") for more details.
[[ methods that never do partial matching (#1504), and throw
an error if the variable does not exist.
all_equal() allows to compare data frames ignoring row and column order,
and optionally ignoring minor differences in type (e.g. int vs. double)
(#821). The test handles the case where the df has 0 columns (#1506).
The test fails fails when convert is
FALSE and types don't match (#1484).
all_equal() shows better error message when comparing raw values
or when types are incompatible and
convert = TRUE (#1820, @krlmlr).
add_row() makes it easy to add a new row to data frame (#1021)
as_data_frame() is now an S3 generic with methods for lists (the old
as_data_frame()), data frames (trivial), and matrices (with efficient
C++ implementation) (#876). It no longer strips subclasses.
The internals of
as_data_frame() have been aligned,
as_data_frame() will now automatically recycle length-1 vectors.
Both functions give more informative error messages if you attempting to
create an invalid data frame. You can no longer create a data frame with
duplicated names (#820). Both check for
POSIXlt columns, and tell you to
POSIXct instead (#813).
frame_data() properly constructs rectangular tables (#1377, @kevinushey),
and supports list-cols.
glimpse() is now a generic. The default method dispatches to
(#1325). It now (invisibly) returns its first argument (#1570).
lst_() which create lists in the same way that
data_frame_() create data frames (#1290).
print.tbl_df() is considerably faster if you have very wide data frames.
It will now also only list the first 100 additional variables not already
on screen - control this with the new
n_extra parameter to
(#1161). When printing a grouped data frame the number of groups is now
printed with thousands separators (#1398). The type of list columns
is correctly printed (#1379)
setOldClass(c("tbl_df", "tbl", "data.frame")) to help
with S4 dispatch (#969).
tbl_df automatically generates column names (#1606).
as_data_frame.tbl_cube() (#1563, @krlmlr).
tbl_cubes are now constructed correctly from data frames, duplicate
dimension values are detected, missing dimension values are filled
NA. The construction from data frames now guesses the measure
variables by default, and allows specification of dimension and/or
measure variables (#1568, @krlmlr).
Swap order of
met_name arguments in
matrix) for consistency with
as.tbl_cube.data.frame. Also, the
met_name argument to
as.tbl_cube.table now defaults to
"Freq" for consistency with
as.data.frame.table (@krlmlr, #1374).
as_data_frame() on SQL sources now returns all rows (#1752, #1821,
compute() gets new parameters
unique_indexes that make
it easier to add indexes (#1499, @krlmlr).
db_explain() gains a default method for DBIConnections (#1177).
The backend testing system has been improved. This lead to the removal of
temp_srcs(). In the unlikely event that you were using this function,
you can instead use
You can now use
full_join() with remote tables (#1172).
src_memdb() is a session-local in-memory SQLite database.
memdb_frame() works like
data_frame(), but creates a new table in
src_sqlite() now uses a stricter quoting character,
`, instead of
". SQLite "helpfully" will convert
"x" into a string if there is
no identifier called x in the current scope (#1426).
src_sqlite() throws errors if you try and use it with window functions
filter.tbl_sql() now puts parens around each argument (#934).
- is better translated (#1002).
escape.POSIXt() method makes it easier to use date times. The date is
rendered in ISO 8601 format in UTC, which should work in most databases
is.na() gets a missing space (#1695).
is.null() get extra parens to make precendence
more clear (#1695).
pmax() are translated to
Work on ungrouped data (#1061).
Warning if order is not set on cumulative window functions.
Multiple partitions or ordering variables in windowed functions no longer generate extra parentheses, so should work for more databases (#1060)
This version includes an almost total rewrite of how dplyr verbs are translated into SQL. Previously, I used a rather ad-hoc approach, which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so in this version I've implemented a much richer internal data model. Now there is a three step process:
When applied to a
tbl_lazy, each dplyr verb captures its inputs
and stores in a
op (short for operation) object.
sql_build() iterates through the operations building to build up an
object that represents a SQL query. These objects are convenient for
testing as they are lists, and are backend agnostics.
sql_render() iterates through the queries and generates the SQL,
using generics (like
sql_select()) that can vary based on the
In the short-term, this increased abstraction is likely to lead to some minor performance decreases, but the chance of dplyr generating correct SQL is much much higher. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries.
If you have written a dplyr backend, you'll need to make some minor changes to your package:
sql_join() has been considerably simplified - it is now only responsible
for generating the join query, not for generating the intermediate selects
that rename the variable. Similarly for
sql_semi_join(). If you've
provided new methods in your backend, you'll need to rewrite.
select_query() gains a distinct argument which is used for generating
distinct(). It loses the
offset argument which was
never used (and hence never tested).
src_translate_env() has been replaced by
should have methods for the connection object.
There were two other tweaks to the exported API, but these are less likely to affect anyone.
partial_eval() got a new API: now use connection +
variable names, rather than a
tbl. This makes testing considerably easier.
translate_sql_q() has been renamed to
Also note that the sql generation generics now have a default method, instead methods for DBIConnection and NULL.
Avoiding segfaults in presence of
raw columns (#1803, #1817, @krlmlr).
arrange() fails gracefully on list columns (#1489) and matrices
(#1870, #1945, @krlmlr).
count() now adds additional grouping variables, rather than overriding
count() can now count a variable
n (#1633). Weighted
The progress bar in
do() is now updated at most 20 times per second,
avoiding uneccessary redraws (#1734, @mkuhn)
distinct() doesn't crash when given a 0-column data frame (#1437).
filter() throws an error if you supply an named arguments. This is usually
filter(df, x = 1) instead of
filter(df, x == 1) (#1529).
summarise() correctly coerces factors with different levels (#1678),
handles min/max of already summarised variable (#1622), and
supports data frames as columns (#1425).
select() now informs you that it adds missing grouping variables
(#1511). It works even if the grouping variable has a non-syntactic name
(#1138). Negating a failed match (e.g.
returns all columns, instead of no columns (#1176)
select() helpers are now exported and have their own
one_of() gives a useful error message if
variables names are not found in data frame (#1407).
The naming behaviour of
mutate_each() has been
tweaked so that you can force inclusion of both the function and the
summarise_each(mtcars, funs(mean = mean), everything())
mutate() handles factors that are all
NA (#1645), or have different
levels in different groups (#1414). It disambiguates
and silently promotes groups that only contain
NA (#1463). It deep copies
data in list columns (#1643), and correctly fails on incompatible columns
mutate() on a grouped data no longer droups grouping attributes
rowwise() mutate gives expected results (#1381).
one_of() tolerates unknown variables in
vars, but warns (#1848, @jennybc).
print.grouped_df() passes on
slice() correctly handles grouped attributes (#1405).
ungroup() generic gains
bind_cols() matches the behaviour of
bind_rows() and ignores
inputs (#1148). It also handles
POSIXcts with integer base type (#1402).
bind_rows() handles 0-length named lists (#1515), promotes factors to
characters (#1538), and warns when binding factor and character (#1485).
bind_rows()` is more flexible in the way it can accept data frames,
lists, list of data frames, and list of lists (#1389).
POSIXlt columns (#1875, @krlmlr).
bind_rows() infer classes and grouping information
from the first data frame (#1692).
grouped_df() methods that make it harder to
create corrupt data frames (#1385). You should still prefer
Joins now use correct class when joining on
(#1582, @joel23888), and consider time zones (#819). Joins handle a
that is empty (#1496), or has duplicates (#1192). Suffixes grow progressively
to avoid creating repeated column names (#1460). Joins on string columns
should be substantially faster (#1386). Extra attributes are ok if they are
identical (#1636). Joins work correct when factor levels not equal
(#1712, #1559). Anti- and semi-joins give correct result when by variable
is a factor (#1571), but warn if factor levels are inconsistent (#2741).
A clear error message is given for joins where an
by contains unavailable columns (#1928, #1932).
Warnings about join column inconsistencies now contain the column names
full_join() gain a
suffix argument which allows you to control what suffix duplicated variable
names recieve (#1296).
Set operations (
union() etc) respect coercion rules
setdiff() handles factors with
NA levels (#1526).
There were a number of fixes to enable joining of data frames that don't
have the same encoding of column names (#1513), including working around
bug 16885 regarding
match() in R 3.3.0 (#1806, #1810,
combine() silently drops
NULL inputs (#1596).
cummean() is more stable against floating point errors (#1387).
lag() received a considerable overhaul. They are more
careful about more complicated expressions (#1588), and falls back more
readily to pure R evaluation (#1411). They behave correctly in
(#1434). and handle default values for string columns.
max() handle empty sets (#1481).
n_distinct() uses multiple arguments for data frames (#1084), falls back to R
evaluation when needed (#1657), reverting decision made in (#567).
Passing no arguments gives an error (#1957, #1959, @krlmlr).
nth() now supports negative indices to select from end, e.g.
selects the 2nd value from the end of
top_n() can now also select bottom
n values by passing a negative value
n (#1008, #1352).
Hybrid evaluation leaves formulas untouched (#1447).
Until now, dplyr's support for non-UTF8 encodings has been rather shaky. This release brings a number of improvement to fix these problems: it's probably not perfect, but should be a lot better than the previously version. This includes fixes to
distinct() (#1179), and joins (#1315).
print.tbl_df() also recieved a fix for strings with invalid encodings (#851).
frame_data() provides a means for constructing
a simple row-wise language. (#1358, @kevinushey)
all.equal() no longer runs all outputs together (#1130).
as_data_frame() gives better error message with NA column names (#1101).
[.tbl_df is more careful about subsetting column names (#1245).
mutate() work on empty data frames (#1142).
summarise() preserve data frame
meta attributes (#1064).
bind_cols() accept lists (#1104): during initial data
cleaning you no longer need to convert lists to data frames, but can
instead feed them to
bind_rows() gains a
.id argument. When supplied, it creates a
new column that gives the name of each data frame (#1337, @lionel-).
bind_rows() respects the
ordered attribute of factors (#1112), and
does better at comparing
POSIXcts (#1125). The
tz attribute is ignored
when determining if two
POSIXct vectors are comparable. If the
all inputs is the same, it's used, otherwise its set to
data_frame() always produces a
tbl_df (#1151, @kevinushey)
filter(x, TRUE, TRUE) now just returns
it doesn't internally modify the first argument (#971), and
it now works with rowwise data (#1099). It once again works with
data tables (#906).
glimpse() also prints out the number of variables in addition to the number
of observations (@ilarischeinin, #988).
Joins handles matrix columns better (#1230), and can join
with heterogenous representations (some
Dates are integers, while other
are numeric). This also improves
cume_dist() so that missing values no longer
affect denominator (#1132).
print.tbl_df() now displays the class for all variables, not just those
that don't fit on the screen (#1276). It also displays duplicated column
names correctly (#1159).
print.grouped_df() now tells you how many groups there are.
mutate() can set to
NULL the first column (used to segfault, #1329) and
it better protects intermediary results (avoiding random segfaults, #1231).
mutate() on grouped data handles the special case where for the first few
groups, the result consists of a
logical vector with only
NA. This can
happen when the condition of an
ifelse is an all
NA logical vector (#958).
mutate.rowwise_df() handles factors (#886) and correctly handles
0-row inputs (#1300).
n_distinct() gains an
na_rm argument (#1052).
Progress bar used by
do() now respects global option
dplyr.show_progress (default is TRUE) so you can turn it off globally
(@jimhester #1264, #1226).
summarise() handles expressions that returning heterogenous outputs,
median(), which that sometimes returns an integer, and other times a
slice() silently drops columns corresponding to an NA (#1235).
ungroup.rowwise_df() gives a
More explicit duplicated column name error message (#996).
When "," is already being used as the decimal point (
use "." as the thousands separator when printing out formatted numbers
build_sql rather than
Improved handling of
n_distinct(x) is translated to
COUNT(DISTINCT(x)) (@skparkes, #873).
print(n = Inf) now works for remote sources (#1310).
Hybrid evaluation does not take place for objects with a class (#1237).
$ handling (#1134).
Simplified code for
lag() and make sure they work properly on
factors (#955). Both repsect the
default argument (#915).
mutate can set to
NULL the first column (used to segfault, #1329).
filter on grouped data handles indices correctly (#880).
sum() issues a warning about integer overflow (#1108).
This is a minor release containing fixes for a number of crashes and issues identified by R CMD CHECK. There is one new "feature": dplyr no longer complains about unrecognised attributes, and instead just copies them over to the output.
lead() for grouped data were confused about indices and therefore
produced wrong results (#925, #937).
lag() once again overrides
instead of just the default method
lag.default(). This is necesary due to
changes in R CMD check. To use the lag function provided by another package,
Fixed a number of memory issues identified by valgrind.
Improved performance when working with large number of columns (#879).
Lists-cols that contain data frames now print a slightly nicer summary (#1147)
Set operations give more useful error message on incompatible data frames (#903).
all.equal() gives the correct result when
all.equal() correctly handles character missing values (#1095).
bind_cols() always produces a
bind_rows() gains a test for a form of data frame corruption (#1074).
summarise() now handles complex columns (#933).
Workaround for using the constructor of
DataFrame on an unprotected object
Improved performance when working with large number of columns (#879).
add_rownames() turns row names into an explicit variable (#639).
as_data_frame() efficiently coerces a list into a data frame (#749).
bind_cols() efficiently bind a list of data frames by
row or column.
combine() applies the same coercion rules to vectors
(it works like
unlist() but is consistent with the
right_join() (include all rows in
y, and matching rows in
full_join() (include all rows in
y) complete the family of
mutating joins (#96).
group_indices() computes a unique integer id for each group (#771). It
can be called on a grouped_df without any arguments or on a data frame
with same arguments as
vignette("data_frames") describes dplyr functions that make it easier
and faster to create and coerce data frames. It subsumes the old
vignette("two-table") describes how two-table verbs work in dplyr.
tbl_df()) now explicitly
forbid columns that are data frames or matrices (#775). All columns
must be either a 1d atomic vector or a 1d list.
do() uses lazyeval to correctly evaluate its arguments in the correct
environment (#744), and new
do_() is the SE equivalent of
You can modify grouped data in place: this is probably a bad idea but it's
sometimes convenient (#737).
do() on grouped data tables now passes in all
columns (not all columns except grouping vars) (#735, thanks to @kismsu).
do() with database tables no longer potentially includes grouping
variables twice (#673). Finally,
do() gives more consistent outputs when
there are no rows or no groups (#625).
last() preserve factors, dates and times (#509).
Overhaul of single table verbs for data.table backend. They now all use
a consistent (and simpler) code base. This ensures that (e.g.)
now works in all verbs (#579).
*_join(), you can now name only those variables that are different between
the two tables, e.g.
inner_join(x, y, c("a", "b", "c" = "d")) (#682).
If non-join colums are the same, dplyr will add
suffixes to distinguish the source (#655).
mutate() handles complex vectors (#436) and forbids
(instead of crashing) (#670).
select() now implements a more sophisticated algorithm so if you're
doing multiples includes and excludes with and without names, you're more
likely to get what you expect (#644). You'll also get a better error
message if you supply an input that doesn't resolve to an integer
column position (#643).
Printing has recieved a number of small tweaks. All
print() method methods
invisibly return their input so you can interleave
print() statements into a
pipeline to see interim results.
print() will column names of 0 row data
frames (#652), and will never print more 20 rows (i.e.
options(dplyr.print_max) is now 20), not 100 (#710). Row names are no
never printed since no dplyr method is guaranteed to preserve them (#669).
glimpse() prints the number of observations (#692)
type_sum() gains a data frame method.
summarise() handles list output columns (#832)
slice() works for data tables (#717). Documentation clarifies that
slice can't work with relational databases, and the examples show
how to achieve the same results using
dplyr now requires RSQLite >= 1.0. This shouldn't affect your code in any way (except that RSQLite now doesn't need to be attached) but does simplify the internals (#622).
Functions that need to combine multiple results into a single column
summarise()) are more careful about
Joining factors with the same levels in the same order preserves the original levels (#675). Joining factors with non-identical levels generates a warning and coerces to character (#684). Joining a character to a factor (or vice versa) generates a warning and coerces to character. Avoid these warnings by ensuring your data is compatible before joining.
rbind_list() will throw an error if you attempt to combine an integer and
rbind()ing a column full of
NAs is allowed and just
collects the appropriate missing value for the column type being collected
summarise() is more careful about
NA, e.g. the decision on the result
type will be delayed until the first non NA value is returned (#599).
It will complain about loss of precision coercions, which can happen for
expressions that return integers for some groups and a doubles for others
A number of functions gained new or improved hybrid handlers:
%in% (#126). That means
when you use these functions in a dplyr verb, we handle them in C++, rather
than calling back to R, and hence improving performance.
min_rank() correctly handles
NaN values (#726). Hybrid
nth() falls back to R evaluation when
n is not
a length one integer or numeric, e.g. when it's an expression (#734).
percent_rank() now preserve NAs (#774)
filter returns its input when it has no rows or no columns (#782).
Join functions keep attributes (e.g. time zone information) from the
left argument for
Date objects (#819), and only
only warn once about each incompatibility (#798).
[.tbl_df correctly computes row names for 0-column data frames, avoiding
problems with xtable (#656).
[.grouped_df will silently drop grouping
if you don't include the grouping columns (#733).
data_frame() now acts correctly if the first argument is a vector to be
recycled. (#680 thanks @jimhester)
filter.data.table() works if the table has a variable called "V1" (#615).
*_join() keeps columns in original order (#684).
Joining a factor to a character vector doesn't segfault (#688).
*_join functions can now deal with multiple encodings (#769),
and correctly name results (#855).
*_join.data.table() works when data.table isn't attached (#786).
group_by() on a data table preserves original order of the rows (#623).
group_by() supports variables with more than 39 characters thanks to
a fix in lazyeval (#705). It gives meaninful error message when a variable
is not found in the data frame (#716).
vars to be a list of symbols (#665).
min(.,na.rm = TRUE) works with
Dates built on numeric vectors (#755).
rename_() generic gets missing
.dots argument (#708).
cume_dist() handle data frames with 0 rows (#762). They all preserve
missing values (#774).
row_number() doesn't segfault when giving an external
variable with the wrong number of variables (#781).
group_indices handles the edge case when there are no variables (#867).
NAs introduced by coercion to integer range on 32-bit Windows (#2708).
between() vector function efficiently determines if numeric values fall
in a range, and is translated to special form for SQL (#503).
count() makes it even easier to do (weighted) counts (#358).
data_frame() by @kevinushey is a nicer way of creating data frames.
It never coerces column types (no more
stringsAsFactors = FALSE!),
never munges column names, and never adds row names. You can use previously
defined columns to compute new columns (#376).
distinct() returns distinct (unique) rows of a tbl (#97). Supply
additional variables to return the first row for each unique combination
setdiff() now have methods
for data frames, data tables and SQL database tables (#93). They pass their
arguments down to the base functions, which will ensure they raise errors if
you pass in two many arguments.
now allow you to join on different variables in
y tables by
supplying a named vector to
by. For example,
by = c("a" = "b") joins
n_groups() function tells you how many groups in a tbl. It returns
1 for ungrouped data. (#477)
transmute() works like
mutate() but drops all variables that you didn't
explicitly refer to (#302).
rename() makes it easy to rename variables - it works similarly to
select() but it preserves columns that you didn't otherwise touch.
slice() allows you to selecting rows by position (#226). It includes
positive integers, drops negative integers and you can use expression like
You can now program with dplyr - every function that does non-standard
evaluation (NSE) has a standard evaluation (SE) version ending in
This is powered by the new lazyeval package which provides all the tools
needed to implement NSE consistently and correctly.
vignette("nse") for full details.
regroup() is deprecated. Please use the more flexible
mutate_each_q() are deprecated. Please use
funs_q has been replaced with
%.% has been deprecated: please use
filter.numeric() removed. Need to figure out how to reimplement with
new lazy eval system.
Progress refclass is no longer exported to avoid conflicts with shiny.
src_monetdb() is now implemented in MonetDB.R, not dplyr.
explain_sql() and matching global options
dplyr.explain_sql have been removed. Instead use
Main verbs now have individual documentation pages (#519).
%>% is simply re-exported from magrittr, instead of creating a local copy
(#496, thanks to @jimhester)
Examples now use
nycflights13 instead of
hflights because it the variables
have better names and there are a few interlinked tables (#562).
nycflights13 are (once again) suggested packages. This means many examples
will not work unless you explicitly install them with
install.packages(c("Lahman", "nycflights13")) (#508). dplyr now depends on
Lahman 3.0.1. A number of examples have been updated to reflect modified
field names (#586).
do() now displays the progress bar only when used in interactive prompts
and not when knitting (#428, @jimhester).
glimpse() now prints a trailing new line (#590).
group_by() has more consistent behaviour when grouping by constants:
it creates a new column with that value (#410). It renames grouping
variables (#410). The first argument is now
.data so you can create
new groups with name x (#534).
Now instead of overriding
lag(), dplyr overrides
which should avoid clobbering lag methods added by other packages.
mutate(data, a = NULL) removes the variable
a from the returned
trunc_mat() and hence
print.tbl_df() and friends gets a
to control the deafult output width. Set
options(dplyr.width = Inf) to
always show all columns (#589).
one_of() selector: this allows you to select variables
provided by a character vector (#396). It fails immediately if you give an
empty pattern to
(#481, @leondutoit). Fixed buglet in
select() so that you can now create
Switched from RC to R6.
top_n() work consistently: neither accidentally
evaluates the the
wt param. (#426, @mnel)
rename handles grouped data (#640).
Correct SQL generation for
paste() when used with the collapse parameter
targeting a Postgres database. (@rbdixon, #1357)
The db backend system has been completely overhauled in order to make
it possible to add backends in other packages, and to support a much
wider range of databases. See
vignette("new-sql-backend") for instruction
on how to create your own (#568).
src_mysql() gains a method for
mutate() creates a new variable that uses a window function,
automatically wrap the result in a subquery (#484).
Correct SQL generation for
order_by() now works in conjunction with window functions in databases
that support them.
All verbs now understand how to work with
difftime() (#390) and
AsIs (#453) objects. They all check that colnames are unique (#483), and
are more robust when columns are not present (#348, #569, #600).
Hybrid evaluation bugs fixed:
Call substitution stopped too early when a sub expression contained a
cumall() properly handle
nth() now correctly preserve the class when using dates, times and
no longer substitutes within
order_by() needs to do
its own NSE (#169).
[.tbl_df always returns a tbl_df (i.e.
drop = FALSE is the default)
[.grouped_df preserves important output attributes (#398).
arrange() keeps the grouping structure of grouped data (#491, #605),
and preserves input classes (#563).
contains() accidentally matched regular expressions, now it passes
fixed = TRUE to
filter() asserts all variables are white listed (#566).
mutate() makes a
rowwise_df when given a
tbl_df objects instead of raw
select() doesn't match any variables, it returns a 0-column data frame,
instead of the original (#498). It no longer fails when if some columns
are not named (#492)
sample_frac() methods for data.frames exported.
A grouped data frame may have 0 groups (#486). Grouped df objects
gain some basic validity checking, which should prevent some crashes
related to corrupt
grouped_df objects made by
More coherence when joining columns of compatible but different types, e.g. when joining a character vector and a factor (#455), or a numeric and integer (#450)
mutate() works for on zero-row grouped data frame, and
with list columns (#555).
LazySubset was confused about input data size (#452).
n_distinct() is stricter about it's inputs: it requires one symbol
which must be from the data frame (#567).
rbind_*() handle data frames with 0 rows (#597). They fill character
vector columns with
NA instead of blanks (#595). They work with
list columns (#463).
Improved handling of encoding for column names (#636).
Improved handling of hybrid evaluation re $ and @ (#645).
Fix major omission in
grouped_dt() methods - I was
accidentally doing a deep copy on every result :(
group_by() now retain over-allocation when working with
data.tables (#475, @arunsrinivasan).
joining two data.tables now correctly dispatches to data table methods, and result is a data table (#470)
summarise.tbl_cube()works with single grouping variable (#480).
dplyr now imports
%>% from magrittr (#330). I recommend that you use this instead of
%.% because it is easier to type (since you can hold down the shift key) and is more flexible. With you
%>%, you can control which argument on the RHS recieves the LHS by using the pronoun
.. This makes
%>% more useful with base R functions because they don't always take the data frame as the first argument. For example you could pipe
mtcars %>% xtabs( ~ cyl + vs, data = .)
Thanks to @smbache for the excellent magrittr package. dplyr only provides
%>% from magrittr, but it contains many other useful functions. To use them, load
library(magrittr). For more details, see
%.% will be deprecated in a future version of dplyr, but it won't happen for a while. I've also deprecated
chain() to encourage a single style of dplyr usage: please use
do() has been completely overhauled. There are now two ways to use it, either with multiple named arguments or a single unnamed arguments.
do() is equivalent to
plyr::dlply, except it always returns a data frame.
If you use named arguments, each argument becomes a list-variable in the output. A list-variable can contain any arbitrary R object so it's particularly well suited for storing models.
library(dplyr) models <- mtcars %>% group_by(cyl) %>% do(lm = lm(mpg ~ wt, data = .)) models %>% summarise(rsq = summary(lm)$r.squared)
If you use an unnamed argument, the result should be a data frame. This allows you to apply arbitrary functions to each group.
mtcars %>% group_by(cyl) %>% do(head(., 1))
Note the use of the
. pronoun to refer to the data in the current group.
do() also has an automatic progress bar. It appears if the computation takes longer than 5 seconds and lets you know (approximately) how much longer the job will take to complete.
dplyr 0.2 adds three new verbs:
glimpse() makes it possible to see all the columns in a tbl,
displaying as much data for each variable as can be fit on a single line.
sample_n() randomly samples a fixed number of rows from a tbl;
sample_frac() randomly samples a fixed fraction of rows. Only works
for local data frames and data tables (#202).
mutate_each() make it easy to apply one or more
functions to multiple columns in a tbl (#178).
If you load plyr after dplyr, you'll get a message suggesting that you load plyr first (#347).
as.tbl_cube() gains a method for matrices (#359, @paulstaab)
temporary argument so you can control whether the
results are temporary or permanent (#382, @cpsievert)
group_by() now defaults to
add = FALSE so that it sets the grouping
variables rather than adding to the existing list. I think this is how
most people expected
group_by to work anyway, so it's unlikely to
cause problems (#385).
Support for MonetDB tables with
(#8, thanks to @hannesmuehleisen).
memory vignette which discusses how dplyr minimises memory usage
for local data frames (#198).
new-sql-backend vignette which discusses how to add a new
SQL backend/source to dplyr.
changes() output more clearly distinguishes which columns were added or
explain() is now generic.
dplyr is more careful when setting the keys of data tables, so it never accidentally modifies an object that it doesn't own. It also avoids unnecessary key setting which negatively affected performance. (#193, #255).
print() methods for
n argument to
control the number of rows printed (#362). They also works better when you have
columns containing lists of complex objects.
row_number() can be called without arguments, in which case it returns
the same as
"comment" attribute is allowed (white listed) as well as names (#346).
hybrid versions of
na.rm argument (#168). This should yield substantial
performance improvements for those functions.
Special case for call to
arrange() on a grouped data frame with no arguments. (#369)
Code adapted to Rcpp > 0.11.1
DataDots class protects against missing variables in verbs (#314),
including the case where
... is missing. (#338)
all.equal.data.frame from base is no longer bypassed. we now have
all.equal.tbl_dt methods (#332).
arrange() correctly handles NA in numeric vectors (#331) and 0 row
data frames (#289).
copy_to.src_mysql() now works on windows (#323)
*_join() doesn't reorder column names (#324).
rbind_all() is stricter and only accepts list of data frames (#288)
rbind_* propagates time zone information for
POSIXct columns (#298).
rbind_* is less strict about type promotion. The numeric
collection of integer and logical vectors. The integer
Collecter also collects
logical values (#321).
sum correctly handles integer (under/over)flow (#308).
summarise() checks consistency of outputs (#300) and drops
attribute of output columns (#357).
join functions throw error instead of crashing when there are no common variables between the data frames, and also give a better error message when only one data frame has a by variable (#371).
n rows instead of
n - 1 (@leondutoit, #367).
SQL translation always evaluates subsetting operators (
select() now renames variables in remote sql tbls (#317) and
implicitly adds grouping variables (#170).
grouped_df_impl function errors if there are no variables to group by (#398).
n_distinct did not treat NA correctly in the numeric case #384.
Some compiler warnings triggered by -Wall or -pedantic have been eliminated.
group_by only creates one group for NA (#401).
Hybrid evaluator did not evaluate expression in correct environment (#403).
select() actually renames columns in a data table (#284).
rbind_list() now handle missing values in factors (#279).
SQL joins now work better if names duplicated in both x and y tables (#310).
Builds against Rcpp 0.11.1
select() correctly works with the vars attribute (#309).
Internal code is stricter when deciding if a data frame is grouped (#308): this avoids a number of situations which previously causedd .
More data frame joins work with missing values in keys (#306).
select() is substantially more powerful. You can use named arguments to
rename existing variables, and new functions
num_range() to select variables based on
their names. It now also makes a shallow copy, substantially reducing its
memory impact (#158, #172, #192, #232).
summarize() added as alias for
summarise() for people from countries
that don't don't spell things correctly ;) (#245)
filter() now fails when given anything other than a logical vector, and
correctly handles missing values (#249).
stats::filter() so you can continue to use
filter() function with
numeric inputs (#264).
summarise() correctly uses newly created variables (#259).
mutate() correctly propagates attributes (#265) and
correctly mutates the same variable repeatedly (#243).
lag() preserve attributes, so they now work with
dates, times and factors (#166).
n() never accepts arguments (#223).
row_number() gives correct results (#227).
rbind_all() silently ignores data frames with 0 rows or 0 columns (#274).
group_by() orders the result (#242). It also checks that columns
are of supported types (#233, #276).
The hybrid evaluator did not handle some expressions correctly, for
if(n() > 5) 1 else 2 the subexpression
n() was not
substituted correctly. It also correctly processes
arrange() checks that all columns are of supported types (#266). It also
handles list columns (#282).
Working towards Solaris compatibility.
Benchmarking vignette temporarily disabled due to microbenchmark problems reported by BDR.
changes() functions which provide more information
about how data frames are stored in memory so that you can see what
sort argument to sort output so highest counts
come first (#173).
as.data.frame.tbl_df() now only
make shallow copies of their inputs (#191).
benchmark-baseball vignette now contains fairer (including grouping
times) comparisons with
filter() (#221) and
summarise() (#194) correctly propagate attributes.
summarise() throws an error when asked to summarise an unknown variable
instead of crashing (#208).
group_by() handles factors with missing values (#183).
filter() handles scalar results (#217) and better handles scoping, e.g.
filter(., variable) where
variable is defined in the function that calls
filter. It also handles
F as aliases to
if there are no
F variables in the data or in the scope.
select.grouped_df fails when the grouping variables are not included
in the selected variables (#170)
all.equal.data.frame() handles a corner case where the data frame has
NULL names (#217)
mutate() gives informative error message on unsupported types (#179)
dplyr source package no longer includes pandas benchmark, reducing download size from 2.8 MB to 0.5 MB.