Data exploration process for data analysis and model building, so that users could focus on understanding data and extracting insights. The package automatically scans through each variable and does data profiling. Typical graphical techniques will be performed for both discrete and continuous features.
Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis. Through this phase, analysts/modelers will have a first look of the data, and thus generate relevant hypothesis and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.
The package can be installed directly from CRAN.
However, the latest stable version (if any) could be found on GitHub, and installed using
if (!require(remotes)) install.packages("remotes")remotes::install_github("boxuancui/DataExplorer")
If you would like to install the latest development version, you may install the dev branch.
if (!require(remotes)) install.packages("remotes")remotes::install_github("boxuancui/DataExplorer", ref = "develop")
The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.
To get a report for the airquality dataset:
To get a report for the diamonds dataset with response variable price:
library(DataExplorer)library(ggplot2)create_report(diamonds, y = "price")
You may also run all the plotting functions individually for your analysis, e.g.,
library(DataExplorer)library(ggplot2)## View basic description for airquality dataintroduce(airquality)plot_intro(airquality)## View missing value distribution for airquality dataplot_missing(airquality)## View distribution of all discrete variablesplot_bar(diamonds)plot_bar(diamonds, with = "price")## View distribution of all continuous variablesplot_histogram(diamonds)plot_density(diamonds)## View quantile-quantile plot of all continuous variablesplot_qq(diamonds)plot_qq(diamonds, by = "price")## View overall correlation heatmapplot_correlation(diamonds)## View bivariate continuous distribution based on `price`plot_boxplot(diamonds, by = "price")## Scatterplot `price` with all other featuresplot_scatterplot(diamonds, by = "price")## Visualize principle component analysisplot_prcomp(diamonds, maxcat = 5L)
To make quick updates to your data:
library(DataExplorer)library(ggplot2)## Group bottom 20% `clarity` by frequencygroup_category(diamonds, feature = "clarity", threshold = 0.2, update = TRUE)## Group bottom 20% `clarity` by `price`group_category(diamonds, feature = "clarity", threshold = 0.2, measure = "price", update = TRUE)## Dummify diamonds datasetdummify(diamonds)dummify(diamonds, select = "cut")## Set values for missing observationsdf <- data.frame("a" = rnorm(260), "b" = rep(letters, 10))df[sample.int(260, 50), ] <- NAset_missing(df, list(0L, "unknown"))## Drop columnsdrop_columns(diamonds, 8:10)drop_columns(diamonds, "clarity")
See article wiki page.
dummifynow works on selected columns.
plot_*are now invisibly returned. As a result, extracted
plot_missingfor missing value profiles.
scale. = TRUEto
create_reportfailure due to zero complete rows.
plot_strwhen plotting data.frame with more than 100 columns.
create_reportfailure (specifically from
plot_prcompto visualize principle component analysis.
plot_correlationas a new function.
introducefor basic metadata.
create_reportcan now be customized.
plot_barnow supports optional measures (in addition to categorical frequency) using argument
plot_strbug for not supporting S4 objects.
plot_densitynot working with column names containing spaces.
plot_scatterplotto visualize relationship of one feature against all other.
plot_boxplotto visualize continuous distributions broken down by another feature.
.Deprecatedmode. List of name changes in alphabetical order:
plot_correlation(..., type = "continuous")
plot_correlation(..., type = "discrete")
CorrelationDiscreteinto one function, and added option to view correlation of all features at once.
quietis not supplied. In addition, report directory are printed through
SetNaToto discrete features.
SetNaToto quickly reset missing numerical values.
DropVarto quickly drop variables by either name or column position.
CorrelationDiscretenow displays all factor levels instead of full rank matrix from
update = TRUEwill only work with input data as
data.table. However, it is still possible to view the frequency distribution with any input data class, as long as
update = FALSE.
GenerateReportnow handles data without discrete or continuous features.
NAvalues will be ignored in
GenerateReportfunction due to package renaming.
GenerateReportwill now print the directory of the report to console.
CollapseCategoryto collapse sparse categories for discrete features.
CorrelationDiscretefor not plotting non-factor class.