A tool for exploring correlations. It makes it possible to easily perform routine tasks when exploring correlation matrices such as ignoring the diagonal, focusing on the correlations of certain variables against others, or rearranging and visualizing the matrix in terms of the strength of the correlations.
corrr is a package for exploring correlations in R. It focuses on creating and working with data frames of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the tidyverse. This, along with the primary corrr functions, is represented below:
You can install:
install.packages("corrr")
install.packages("devtools") # run this line if devtools is not installeddevtools::install_github("drsimonj/corrr")
Using corrr
typically starts with correlate()
, which acts like the base correlation function cor()
. It differs by defaulting to pairwise deletion, and returning a correlation data frame (cor_df
) of the following structure:
tbl
with an additional class, cor_df
NA
) so they can be ignored.The corrr API is designed with data pipelines in mind (e.g., to use %>%
from the magrittr package). After correlate()
, the primary corrr functions take a cor_df
as their first argument, and return a cor_df
or tbl
(or output like a plot). These functions serve one of three purposes:
Internal changes (cor_df
out):
shave()
the upper or lower triangle (set to NA).rearrange()
the columns and rows based on correlation strengths.Reshape structure (tbl
or cor_df
out):
focus()
on select columns and rows.stretch()
into a long format.Output/visualisations (console/plot out):
fashion()
the correlations for pretty printing.rplot()
the correlations with shapes in place of the values.network_plot()
the correlations in a network.library(MASS)library(corrr)set.seed(1)# Simulate three columns correlating about .7 with each othermu <- rep(0, 3)Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)# Simulate three columns correlating about .4 with each othermu <- rep(0, 3)Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)# Bind togetherd <- cbind(seven, four)colnames(d) <- paste0("v", 1:ncol(d))# Insert some missing valuesd[sample(1:nrow(d), 100, replace = TRUE), 1] <- NAd[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA# Correlatex <- correlate(d)class(x)#> [1] "cor_df" "tbl_df" "tbl" "data.frame"x#> # A tibble: 6 x 7#> rowname v1 v2 v3 v4 v5 v6#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 v1 NA 0.710 0.709 0.000195 0.0214 -0.0435#> 2 v2 0.710 NA 0.697 -0.0133 0.00928 -0.0338#> 3 v3 0.709 0.697 NA -0.0253 0.00109 -0.0201#> 4 v4 0.000195 -0.0133 -0.0253 NA 0.421 0.442#> 5 v5 0.0214 0.00928 0.00109 0.421 NA 0.425#> 6 v6 -0.0435 -0.0338 -0.0201 0.442 0.425 NA
As a tbl
, we can use functions from data frame packages like dplyr
, tidyr
, ggplot2
:
library(dplyr)# Filter rows by correlation sizex %>% filter(v1 > .6)#> # A tibble: 2 x 7#> rowname v1 v2 v3 v4 v5 v6#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 v2 0.710 NA 0.697 -0.0133 0.00928 -0.0338#> 2 v3 0.709 0.697 NA -0.0253 0.00109 -0.0201
corrr functions work in pipelines (cor_df
in; cor_df
or tbl
out):
x <- datasets::mtcars %>%correlate() %>% # Create correlation data frame (cor_df)focus(-cyl, -vs, mirror = TRUE) %>% # Focus on cor_df without 'cyl' and 'vs'rearrange() %>% # rearrange by correlationsshave() # Shave off the upper triangle for a clean result#>#> Correlation method: 'pearson'#> Missing treated using: 'pairwise.complete.obs'fashion(x)#> rowname am drat gear wt disp mpg hp qsec carb#> 1 am#> 2 drat .71#> 3 gear .79 .70#> 4 wt -.69 -.71 -.58#> 5 disp -.59 -.71 -.56 .89#> 6 mpg .60 .68 .48 -.87 -.85#> 7 hp -.24 -.45 -.13 .66 .79 -.78#> 8 qsec -.23 .09 -.21 -.17 -.43 .42 -.71#> 9 carb .06 -.09 .27 .43 .39 -.55 .75 -.66rplot(x)
datasets::airquality %>%correlate() %>%network_plot(min_cor = .2)#>#> Correlation method: 'pearson'#> Missing treated using: 'pairwise.complete.obs'
Improves support for tbl_sql()
objects
Switches correlation calculation for tbl_spark()
tables to sparklyr::ml_corr()
Adds package level doc (@jsta, #66)
Fixes typo on error message (@jsta)
Removes Database vignette. Plan to re-add later on (#76)
Minor updates to Using corrr vignette
Fixes test and CRAN issues by removing Ops.cor_df()
.
Designates Edgar Ruiz as the new package maintainer
The diagonal
argument of as_matrix
and as_matrix.cor_df
is now an optional argument rather than set to 1
by default #52
as_cordf
will coerce lists or matrices into correlation data frames if possible.focus_if
enables conditional variable selection.+
or -
) with correlation data frames.rplot
and network_plot
) will attempt to coerce objects to a correlation data frame (via as_cordf
) if needed, making it possible to directly use these functions with other square-matrix-like objects.repel
option added to network_plot
(default = TRUE
).curved
option added to network_plot
(default = TRUE
).correlate()
now prints a message about the method
and use
parameters. Can be silenced with quiet = TRUE
.correlate()
now supports data frame with a SQL back-end (tbl_sql
)legend = TRUE
(now the default setting), rplot
and network_plot
generate a single, unlabelled legend referring to the size of the correlations.correlate()
is now an S3 method so that it can adapt to x
's object type.
During the development of this version, ggplot v2.2.0 was released. Many changes in the plotting functions have been made to handle new features in the updated version of ggplot2.
Improvements to the package folder structure
fashion()
with new argument leading_zeros = TRUE
.network_plot()
and rplot()
:
legend
to display a legend mapping correlations to size and colour.colours
(or colors
) to change colours in plot.network_plot()
no longer plots wrong colours if only positive correlations are included.network_plot()
changed to match rplot()
.network_plot()
the correlations.focus_()
for standard evaluation version of focus()
.fashion()
will now attempt to work on any object (not just cor_df
), making it useful for printing any data frame, matrix, vector, etc.print_cor
argument added to rplot()
to overlay the correlations as text.na_omit
argument in stretch()
changed to na.rm
to match gather_()
.