Preliminary Visualisation of Data

Create preliminary exploratory data visualisations of an entire dataset to identify problems or unexpected features using 'ggplot2'.


Travis Build StatusAppVeyor Build StatusCoverage Status

How to install

 
devtools::install_github("njtierney/visdat")

What does visdat do?

Initially inspired by csv-fingerprint, vis_dat helps you visualise a dataframe and "get a look at the data" by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss.

The name visdat was chosen as I think in the future it could be integrated with testdat. The idea being that first you visualise your data (visdat), then you run tests from testdat to fix them.

There are two main commands in the visdat package:

  • vis_dat() visualises a dataframe showing you what the classes of the columns are, and also displaying the missing data.

  • vis_miss() visualises just the missing data, and allows for missingness to be clustered and columns rearranged. vis_miss() is similar to missing.pattern.plot from the mi package. Unfortunately missing.pattern.plot is no longer in the mi package (as of 14/02/2016).

You can read more about visdat in the vignette, "using visdat"".

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Examples

Using vis_dat()

Let's see what's inside the airquality dataset from base R, which contains information about daily air quality measurements in New York from May to September 1973. More information about the dataset can be found with ?airquality.

 
library(visdat)
 
vis_dat(airquality)

The plot above tells us that R reads this dataset as having numeric and integer values, with some missing data in Ozone and Solar.R. The classes are represented on the legend, and missing data represented by grey. The column/variable names are listed on the x axis.

By default, vis_dat sorts the columns according to the type of the data in the vectors. You can turn this off by setting sort_type = FALSE.

 
vis_dat(airquality, 
        sort_type = FALSE)

With many kinds of data

To demonstrate what visdat looks like when you have different kinds of data, we can look at the dataset typical_data, provided within visdat, and created with the excellent wakefield package.

 
vis_dat(typical_data)

We can also look into using even wider data, looking at typical_larger_data

 
vis_dat(typical_data_large)

Using vis_miss()

We can explore the missing data further using vis_miss().

 
vis_miss(airquality)

The percentages of missing/complete in vis_miss are accurate to 1 decimal place.

You can cluster the missingness by setting cluster = TRUE.

 
vis_miss(airquality, 
         cluster = TRUE)

The columns can also just be arranged by columns with most missingness, by setting sort_miss = TRUE.

 
vis_miss(airquality,
         sort_miss = TRUE)

vis_miss indicates when there is a very small amount of missing data at <0.1% missingness.

 
test_miss_df <- data.frame(x1 = 1:10000,
                           x2 = rep("A", 10000),
                           x3 = c(rep(1L, 9999), NA))
 
vis_miss(test_miss_df)

vis_miss will also indicate when there is no missing data at all.

 
vis_miss(mtcars)

Thank yous

Thank you to Ivan Hanigan who first commented this suggestion after I made a blog post about an initial prototype ggplot_missing, and Jenny Bryan, whose tweet got me thinking about vis_dat, and for her code contributions that removed a lot of errors.

Thank you to Hadley Wickham for suggesting the use of the internals of readr to make vis_guess work.

Thank you to Miles McBain for his suggestions on how to improve vis_guess. This resulted in making it at least 2-3 times faster.

Thanks also to Carson Sievert for writing the code that combined plotly with visdat, and for Noam Ross for suggesting this in the first place.

News

visdat 0.1.0 (2017/07/03)

  • lightweight CRAN submission - will only contain functions vis_dat and vis_miss

visdat 0.0.7.9100 (2017/07/03)

New Features

  • add_vis_dat_pal() (internal) to add a palette for vis_dat and vis_guess
  • vis_guess now gets a palette argument like vis_dat
  • Added protoype/placeholder functions for plotly vis_*_ly interactive graphs:
    • vis_guess_ly()
    • vis_dat_ly()
    • vis_compare_ly() These simply wrap plotly::ggplotly(vis_*(data)). In the future they will be written in plotly so that they can be generated much faster

Minor improvements

  • corrected testing for vis_* family
  • added .svg graphics for correct vdiffr testing
  • improved hover print method for plotly.

visdat 0.0.6.9000 (2017/02/26)

New Features

  • axes in vis_ family are now flipped by default
  • vis_miss now shows the % missingness in a column, can be disabled by setting show_perc_col argument to FALSE
  • removed flip argument, as this should be the default

Minor Improvements

  • added internal functions to improve extensibility and debugging - vis_create_, vis_gather_ and vis_extract_value_.
  • suppress unneeded warnings arising from compiling factors

visdat 0.0.5.9000 (2017/01/09)

Minor Improvements

  • Added testing for visualisations with vdiffr. Code coverage is now at 99%
  • Fixed up suggestions from goodpractice::gp()
  • Submitted to rOpenSci onboarding
  • paper.md written and submitted to JOSS

visdat 0.0.4.9999 (2017/01/08)

New Feature

  • Added feature flip = TRUE, to vis_dat and vis_miss. This flips the x axis and the ordering of the rows. This more closely resembles a dataframe.
  • vis_miss_ly is a new function that uses plotly to plot missing data, like vis_miss, but interactive, without the need to call plotly::ggplotly on it. It's fast, but at the moment it needs a bit of love on the legend front to maintain the style and features (clustering, etc) of current vis_miss.
  • vis_miss now gains a show_perc argument, which displays the % of missing and complete data. This is switched on by default and addresses issue #19.

New Feature (under development)

  • vis_compare is a new function that allows you to compare two dataframes of the same dimension. It gives a fairly ugly warning if they are not of the same dimension.
  • vis_dat gains a "palette" argument in line with issue 26, drawn from http://colorbrewer2.org/, there are currently three arguments, "default", "qual", and "cb_safe". "default" provides the ggplot defaults, "qual" uses some colour blind unfriendly colours, and "cb_safe" provides some colours friendly for colour blindness.

Minor Improvements

  • All lines are < 80 characters long
  • removed all instances of 1:rnow(x) and replaced with seq_along(nrow(x)).
  • Updated documentation, improved legend and colours for vis_miss_ly.
  • removed export for vis_dat_ly, as it currently does not work.
  • Removed a lot of unnecessary @importFrom tags, included magrittr in this, and added magrittr to Imports
  • Changes ALL CAPS Headers in news to Title Case
  • Made it clear that vis_guess() and vis_compare are very beta
  • updated documentation in README and vis_dat(), vis_miss(), vis_compare(), and vis_guess()
  • updated pkgdown docs
  • updated DESCRIPTION URL and bug report
  • Changed the default colours of vis_compare to be different to the ggplot2 standards.
  • vis_miss legend labels are created using the internal function miss_guide_label. miss_guide_label will check if data is 100% missing or 100% present and display this in the figure. Additionally, if there is less than 0.1% missing data, "<0.1% missingness" will also be displayed. This sort of gets around issue #18 for the moment.
  • tests have been added for the miss_guide_label legend labels function.
  • Changed legend label for vis_miss, vis_dat, and vis_guess.
  • updated README
  • Added vignette folder (but not vignettes added yet)
  • Added appveyor-CI and travis-CI, addressing issues #22 and #23

Bug Fixes

  • Update vis_dat() to use purrr::dmap(fingerprint) instead of mutate_each_(). This solves issue #3 where vis_dat couldn't take variables with spaces in their name.

visdat 0.0.3.9000

=========================

New Features

  • Interactivity with plotly::ggplotly! Funcions vis_guess(), vis_dat(), and vis_miss were updated so that you can make them all interactive using the latest dev version of plotly from Carson Sievert.

visdat 0.0.2.9000

=========================

New Features

  • Introducing vis_guess(), a function that uses the unexported function collectorGuess from readr.

visdat 0.0.1.9000

=========================

New Features

  • vis_miss() and vis_dat actually run

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("visdat")

0.5.1 by Nicholas Tierney, 17 days ago


http://visdat.njtierney.com/, https://github.com/ropensci/visdat


Report a bug at https://github.com/ropensci/visdat/issues


Browse source code at https://github.com/cran/visdat


Authors: Nicholas Tierney [aut, cre] (<https://orcid.org/0000-0003-1460-8722>), Sean Hughes [rev] (<https://orcid.org/0000-0002-9409-9405>, Sean Hughes reviewed the package for rOpenSci, see https://github.com/ropensci/onboarding/issues/87), Mara Averick [rev] (Mara Averick reviewed the package for rOpenSci, see https://github.com/ropensci/onboarding/issues/87), Stuart Lee [ctb], Earo Wang [ctb]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports ggplot2, tidyr, dplyr, purrr, readr, plotly, magrittr, stats, tibble, rlang

Suggests testthat, knitr, rmarkdown, vdiffr, gdtools


Imported by PCRedux, naniar.


See at CRAN