Missing values are ubiquitous in data and need to be explored and handled in the initial stages of analysis. 'naniar' provides data structures and functions that facilitate the plotting of missing values and examination of imputations. This allows missing data dependencies to be explored with minimal deviation from the common work patterns of 'ggplot2' and tidy data.
naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data. It does this by providing:
miss_var_run), and cases (
For more details on the workflow and theory underpinning naniar, read the vignette Getting started with naniar.
For a short primer on the data visualisation available in naniar, read the vignette Gallery of Missing Data Visualisations.
Currently naniar is only available on github
Visualising missing data might sound a little strange - how do you visualise something that is not there? One approach to visualising missing data comes from ggobi and manet, where we replace "NA" values with values 10% lower than the minimum value in that variable. This is provided with the
geom_miss_point() ggplot2 geom, which we can illustrate by exploring the relationship between Ozone and Solar radiation from the airquality dataset.
library(ggplot2)ggplot(data = airquality,aes(x = Ozone,y = Solar.R)) +geom_point()#> Warning: Removed 42 rows containing missing values (geom_point).
ggplot2 does not handle these missing values, and we get a warning message about the missing values.
We can instead use the
geom_miss_point() to display the missing data
library(naniar)ggplot(data = airquality,aes(x = Ozone,y = Solar.R)) +geom_miss_point()
geom_miss_point() has shifted the missing values to now be 10% below the minimum value. The missing values are a different colour so that missingness becomes pre-attentive. As it is a ggplot2 geom, it supports features like faceting and other ggplot features.
p1 <-ggplot(data = airquality,aes(x = Ozone,y = Solar.R)) +geom_miss_point() +facet_wrap(~Month, ncol = 2) +theme(legend.position = "bottom")p1
naniar provides a data structure for working with missing data, the shadow matrix (Swayne and Buja, 1998). The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as “NA”, and not missing is represented as “!NA”, and variable names are kep the same, with the added suffix “_NA" to the variables.
head(airquality)#> Ozone Solar.R Wind Temp Month Day#> 1 41 190 7.4 67 5 1#> 2 36 118 8.0 72 5 2#> 3 12 149 12.6 74 5 3#> 4 18 313 11.5 62 5 4#> 5 NA NA 14.3 56 5 5#> 6 28 NA 14.9 66 5 6as_shadow(airquality)#> # A tibble: 153 x 6#> Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA#> <fctr> <fctr> <fctr> <fctr> <fctr> <fctr>#> 1 !NA !NA !NA !NA !NA !NA#> 2 !NA !NA !NA !NA !NA !NA#> 3 !NA !NA !NA !NA !NA !NA#> 4 !NA !NA !NA !NA !NA !NA#> 5 NA NA !NA !NA !NA !NA#> 6 !NA NA !NA !NA !NA !NA#> 7 !NA !NA !NA !NA !NA !NA#> 8 !NA !NA !NA !NA !NA !NA#> 9 !NA !NA !NA !NA !NA !NA#> 10 NA !NA !NA !NA !NA !NA#> # ... with 143 more rows
Using the shadow matrix helps you manage where missing values are in your dataset and make it easy to do visualisations where you split by missingness:
airquality %>%bind_shadow() %>%ggplot(aes(x = Temp,fill = Ozone_NA)) +geom_density()
And even visualise imputations
airquality %>%bind_shadow() %>%simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%ggplot(aes(x = Solar.R,y = Ozone,colour = Ozone_NA)) +geom_point()#> Warning: Removed 7 rows containing missing values (geom_point).
naniar does this while following consistent principles that are easy to read, thanks to the tools of the tidyverse.
naniar also provides handy visualations for each variable:
Or the number of missings in a given variable at a repeating span
gg_miss_span(pedestrian,var = hourly_counts,span_every = 1500)
You can read about all of the visualisations in naniar in the vignette Gallery of missing data visualisations using naniar.
naniar also provides handy helpers for calculating the number, proportion, and percentage of missing and complete observations:
n_miss(airquality)#>  44n_complete(airquality)#>  874prop_miss(airquality)#>  0.04793028prop_complete(airquality)#>  0.9520697pct_miss(airquality)#>  4.793028pct_complete(airquality)#>  95.20697
naniar provides numerical summaries of missing data, that follow a consistent rule that uses a syntax begining with
miss_. Summaries focussing on variables or a single selected variable, start with
miss_var_, and summaries for cases (the initial collected row order of the data), they start with
miss_case_. All of these functions that return dataframes also work with dplyr's
For example, we can look at the number and percent of missings in each case and variable with
miss_case_summary(), which both return output ordered by the number of missing values.
miss_var_summary(airquality)#> # A tibble: 6 x 3#> variable n_missing percent#> <chr> <int> <dbl>#> 1 Ozone 37 24.183007#> 2 Solar.R 7 4.575163#> 3 Wind 0 0.000000#> 4 Temp 0 0.000000#> 5 Month 0 0.000000#> 6 Day 0 0.000000miss_case_summary(airquality)#> # A tibble: 153 x 3#> case n_missing percent#> <int> <int> <dbl>#> 1 5 2 33.33333#> 2 27 2 33.33333#> 3 6 1 16.66667#> 4 10 1 16.66667#> 5 11 1 16.66667#> 6 25 1 16.66667#> 7 26 1 16.66667#> 8 32 1 16.66667#> 9 33 1 16.66667#> 10 34 1 16.66667#> # ... with 143 more rows
You could also
group_by() to work out the number of missings in each variable across the levels within it.
library(dplyr)#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#> filter, lag#> The following objects are masked from 'package:base':#>#> intersect, setdiff, setequal, unionairquality %>%group_by(Month) %>%miss_var_summary()#> # A tibble: 25 x 4#> Month variable n_missing percent#> <int> <chr> <int> <dbl>#> 1 5 Ozone 5 16.12903#> 2 5 Solar.R 4 12.90323#> 3 5 Wind 0 0.00000#> 4 5 Temp 0 0.00000#> 5 5 Day 0 0.00000#> 6 6 Ozone 21 70.00000#> 7 6 Solar.R 0 0.00000#> 8 6 Wind 0 0.00000#> 9 6 Temp 0 0.00000#> 10 6 Day 0 0.00000#> # ... with 15 more rows
You can read more about all of these functions in the vignette "Getting Started with naniar".
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
geom_miss_*family to include categorical variables, Bivariate plots: scatterplots, density overlays
Firstly, thanks to Di Cook for giving the initial inspiration for the package and laying down the rich theory and literature that the work in naniar is built upon. Naming credit (once again!) goes to Miles McBain. Among various other things, Miles also worked out how to overload the missing data and make it work as a geom. Thanks also to Colin Fay for helping me understand tidy evaluation and for features such as
miss_*_cumsum, and more.
naniar was previously named
ggmissing and initially provided a ggplot geom and some other visualisations.
ggmissing was changed to naniar to reflect the fact that this package is going to be bigger in scope, and is not just related to ggplot2. Specifically, the package is designed to provide a suite of tools for generating visualisations of missing values and imputations, manipulate, and summarise missing data.
Well, I think it is useful to think of missing values in data being like this other dimension, perhaps like C.S. Lewis's Narnia - a different world, hidden away. You go inside, and sometimes it seems like you've spent no time in there but time has passed very quickly, or the opposite. Also,
NAniar = na in r, and if you so desire, naniar may sound like "noneoya" in an nz/aussie accent. Full credit to @MilesMcbain for the name, and @Hadley for the rearranged spelling.
naniaronto CRAN, updates to
naniarwill happen reasonably regularly after this approximately every 1-2 months
group_byis now respected by the following functions:
label_missto be more consistent with the rest of naniar
miss_df_pct- this was literally the same as
show_pctargument to show the percentage of missing values (Thanks Jennifer for the helpful feedback! :))
miss_case_summarynow have consistent output (one was ordered by n_missing, not the other).
x(as adviced by Hadley)
replace_to_nais a complement to
tidyr::replace_naand replaces a specified value from a variable to NA.
gg_miss_fctreturns a heatmap of the number of missings per variable for each level of a factor. This feature was very kindly contributed by Colin Fay.
gg_miss_functions now return a ggplot object, which behave as such.
gg_miss_basic themes can be overriden with ggplot functions. This fix was very kindly contributed by Colin Fay.
add_*functions handle bare unqouted names where appropriate as per #61
geom_miss_point(), to keep consistent with the rest of the functions in
taoas per #59
added github issue / contribution / pull request guides
ts generic functions are now
gg_miss_span and work on
data.frame's, as opposed to just
add_shadow_shift() adds a column of shadow_shifted values to the current dataframe, adding "_shift" as a suffix
cast_shadow() - acts like
bind_shadow() but allows for specifying which columns to add
shadow_shift now has a method for factors - powered by
gg_missing_*is changed to
gg_miss_*to fit with other syntax
shadow_cat, as they are no longer needed, and have been superceded by
pedestrian- contains hourly counts of pedestrians
miss_ts_run(): return the number of missings / complete in a single run
miss_ts_summary(): return the number of missings in a given time period
gg_miss_ts(): plot the number of missings in a given time period
narnia- I had to explain the spelling a few times when I was introducing the package and I realised that I should change the name. Fortunately it isn't on CRAN yet.
prop_missand the complement
n_missreturns the number of missing values,
prop_missreturns the proportion of missing values. Likewise,
prop_completereturns the proportion of complete values.
The left hand side functions have been made defunct in favour of the right hand side.
miss_*= I want to explore missing values
miss_case_*= I want to explore missing cases
miss_case_pct= I want to find the percentage of cases containing a missing value
miss_case_summary= I want to find the number / percentage of missings in each case
miss_case_table= I want a tabulation of the number / percentage of cases missing
This is more consistent and easier to reason with.
Thus, I have renamed the following functions:
These will be made defunct in the next release, 0.0.6.9000 ("The Wood Between Worlds").
n_completeis a complement to
n_miss, and counts the number of complete values in a vector, matrix, or dataframe.
shadow_shiftnow handles cases where there is only 1 complete value in a vector.
After a burst of effort on this package I have done some refactoring and thought hard about where this package is going to go. This meant that I had to make the decision to rename the package from ggmissing to naniar. The name may strike you as strange but it reflects the fact that there are many changes happening, and that we will be working on creating a nice utopia (like Narnia by CS Lewis) that helps us make it easier to work with missing data
add_prop_miss are helpers that add columns to a dataframe containing the number and proportion of missing values. An example has been provided to use decision trees to explore missing data structure as in Tierney et al
geom_miss_point() now supports transparency, thanks to @seasmith (Luke Smith)
more shadows. These are mainly around
gather_shadow, which are helper functions to assist with creating
geom_missing_point() broke after the new release of ggplot2 2.2.0, but this is now fixed by ensuring that it inherits from GeomPoint, rather than just a new Geom. Thanks to Mitchell O'hara-Wild for his help with this.
missing data summaries
table_missing_case also now return more sensible numbers and variable names. It is possible these function names will change in the future, as these are kind of verbose.
semantic versioning was incorrectly entered in the DESCRIPTION file as 0.2.9000, so I changed it to 0.0.2.9000, and then to 0.0.3.9000 now to indicate the new changes, hopefully this won't come back to bite me later. I think I accidentally did this with visdat at some point as well. Live and learn.
gathered related functions into single R files rather than leaving them in their own.
correctly imported the
%>% operator from magrittr, and removed a lot of chaff around
@importFrom - really don't need to use
@importFrom that often.
geom_missing_point()now works in a way that we expect! Thanks to Miles McBain for working out how to get this to work.
percent_missing_dfreturns the percentage of missing data for a data.frame
percent_missing_varthe percentage of variables that contain missing values
percent_missing_casethe percentage of cases that contain missing values.
table_missing_vartable of missing information for variables
table_missing_casetable of missing information for cases
summary_missing_varsummary of missing information for variables (counts, percentages)
summary_missing_casesummary of missing information for variables (counts, percentages)