Data and Variable Transformation Functions

Collection of miscellaneous utility functions, supporting data transformation tasks like recoding, dichotomizing or grouping variables, setting and replacing missing values. The data transformation functions also support labelled data, and all integrate seamlessly into a 'tidyverse'-workflow.


Collection of miscellaneous utility functions, supporting data transformation tasks like recoding, dichotomizing or grouping variables, setting and replacing missing values. The data transformation functions also support labelled data.

The functions of sjmisc are designed to work together seamlessly with other packes from the tidyverse, like dplyr. For instance, you can use the functions from sjmisc both within a pipe-worklflow to manipulate data frames, or to create new variables with mutate(). See vignette("design_philosophy", "sjmisc") for more details.

Installation

Latest development build

To install the latest development snapshot (see latest changes below), type following commands into the R console:

library(devtools)
devtools::install_github("strengejacke/sjmisc")

Please note the package dependencies when installing from GitHub. The GitHub version of this package may depend on latest GitHub versions of my other packages, so you may need to install those first, if you encounter any problems. Here's the order for installing packages from GitHub:

sjlabelledsjmiscsjstatsggeffectssjPlot

Officiale, stable release

CRAN_Status_Badge    downloads    total

To install the latest stable release from CRAN, type following command into the R console:

install.packages("sjmisc")

References, documentation and examples

A cheatsheet can be downloaded from here (PDF) or from the RStudio cheatsheet collection.

For more examples, see package vignettes (browseVignettes("sjmisc")).

Citation

In case you want / have to cite my package, please use citation('sjmisc') for citation information.

News

sjmisc 2.7.0

General

  • Breaking changes: The append-argument in recode and transformation functions like rec(), dicho(), split_var(), group_var(), center(), std(), recode_to(), row_sums(), row_count(), col_count() and row_means() now defaults to TRUE.
  • The print()-method for descr() now accepts a digits-argument, to specify the rounding of the output.
  • Cross refences from dplyr::select_helpers were updated to tidyselect::select_helpers.

New functions

  • is_whole() as counterpart to is_float().

Changes to functions

  • frq() now prints variable names for non-labelled data, adds variable names in braces for labelled data and omits the label column for non-labelled data.
  • frq() now prints mean and standard deviation in the header line of the output.
  • frq() now gets a auto.grp-argument to automatically group variables with many unique values.
  • frq() now gets a show.strings-argument to omit string variables (character vectors) from being printed as frequency table.
  • frq() now gets a grp.strings-argument to group similar string values in the frequency table.
  • frq() gets an out-argument, to print output to console, or as HTML table in the viewer or web browser.
  • descr() gets an out-argument, to print output to console, or as HTML table in the viewer or web browser.

Bug fixes

  • is_empty() returned TRUE for single vectors with NA being the first element.
  • Fix issue where due to a bug during code cleanup, remove_empty_rows() did no longer remove empty rows, but columns.

sjmisc 2.6.3

General

  • Revised examples that used removed methods from other packages.
  • Use select-helpers from package tidyselect, instead of dplyr.
  • Beautiful colored output for frq(), descr() and flat_table().

Changes to functions

  • rec() now also recodes doubles with floating points, if a range of values is specified.
  • std() and center() now use include.fac = FALSE as default option.
  • std() gets a robust-argument, to divide variables either by standard deviation, or - in case of asymmetrically distributed variables - median absolute deviation or Gini's mean difference.
  • frq() now shows total and valid N in output.

Bug fixes

  • center(), std(), dicho(), split_var() and group_var() did not work correctly for grouped data frames.
  • frq() did not print multiple variables when applied on grouped data frames.

sjmisc 2.6.2

Changes to functions

  • Arguments as.df and as.varlab in function find_var() are now deprecated. Please use out instead.
  • rotate_df() preserves attributes.
  • is_float() is now exported as function.

Bug fixes

  • Fixed bug for to_label(), when x was a character vector and argument drop.levels was TRUE.

sjmisc 2.6.1

General

  • Fixed issue with latest tidyr-update on CRAN.

Bug fixes

  • frq() did not correctly calculate valid and cumulative percentages when using weights.

sjmisc 2.6.0

General

  • All labelled-data functions were removed and are now in package sjlabelled.

New functions

  • remove_var() as pipe-friendly function to remove variables from data frames.
  • var_type() as pipe-friendly function to determine the type of variables.
  • all_na() to check whether a vector only consists of NA values.
  • rotate_df() to rotate data frames (switch columns and rows).
  • shorten_string(), to shorten strings to a certain maxium number of chars.

Changes to functions

  • Following functions now also work on grouped data frames: dicho(), split_var(), group_var(), std() and center().
  • Argument groupcount in split_var(), group_var() and group_labels() is now named n.
  • Argument groupsize in group_var() and group_labels() is now named size.
  • frq() gets a revised print-method, which does not print the result to console when captured in an object (i.e., x <- frq(x) no longer prints the result).
  • frq() no longer prints (redundant) labels for factors w/o value label attributes.
  • frq() adds information about the variable type in the table caption (only for variables with variable labels).
  • frq() adds information about groups when printing grouped, non-labelled variables.
  • descr() now also prints information about the variable type.
  • to_character() now preserves variable labels.

sjmisc 2.5.0

General

  • sjmisc now uses dplyr's tidyeval-approach to evaluate arguments. This means that the select-helper-functions (like one_of() or contains()) no longer need to be prefixed with a ~ when used as argument within sjmisc-functions.
  • All labelled-data functions are now deprecated and will become defunct in future package versions. The labelled-data functions have been moved into a separate package, sjlabelled.

New functions

  • row_count() to count specific values in a data frame per observation.
  • col_count() to count specific values in a data frame per variable.
  • str_start() and str_end() to find starting and end indices of patterns inside strings.

Changes to functions

  • The output for frq() now always includes a NA-row, but no longer prints a value for the NA-row.
  • merge_imputations() gets a summary-argument to plot a graphical summary of the quality of the merging process.

Bug fixes

  • add_columns() and replace_columns() crashed R when no data frame was specified in ...-ellipses argument.
  • descr() and frq() used wrong variable labels when processing grouped data frames for specific situations, where the grouping variable had no sequences values.
  • descr() did not work for large data frames, because internally, because psych::describe() switched to fast mode by default then (removing columns from the output).

sjmisc 2.4.0

General

  • Argument value in set_na() is deprecated. Please use na instead.
  • Argument recodes in rec() is deprecated. Please use rec instead.
  • Argument lab in set_label() is deprecated. Please use label instead.
  • Argument value in add_labels() and replace_labels() is deprecated. Please use labels instead.
  • Argument value in ref_lvl() is deprecated. Please use lvl instead.

New functions

  • row_sums() as wrapper of rowSums() to compute row sums, but works within pipe-workflow and with select-helpers for variables, and always returns a tibble..
  • row_means() as wrapper of sjstats::mean_n() to compute row means, but works within pipe-workflow and with select-helpers for variables, and always returns a tibble..
  • %nin% as complement to %in%.

Changes to functions

  • Functions rec(), dicho(), center(), std(), recode_to() and group_var() get an append-argument, to optionally return the original data including the transformed variables as new columns.
  • var_labels() and var_rename() now give a warning if specified variables to rename or relabel do not exist in the data frame. Non-matching variables are ignored.
  • If model.term does not exist in models, spread_coef() now prints the name of non-existing coefficients.
  • find_var() gets a fuzzy-argument to enable fuzzy-matching for search pattern.

Bug fixes

  • remove_empty_cols() returned an empty data frame, when input data frame had no empty columns.
  • remove_empty_rows() returned an empty data frame, when input data frame had no empty rows.
  • add_columns() and replace_columns() in some cases coerced data frames of class data.frame with only one column into a vector, which gave an error when binding columns.
  • Argument part.dist.match in str_pos() caused an error when being larger than 0.

sjmisc 2.3.1

General

  • Re-exports magrittr::%>% (Bob Rudis said more packages should do so).

New functions

  • replace_columns() to replace variables in one data frame with variables from other data frames.

Changes to functions

  • descr() gets a max.length-argument to shorten variable labels in the output to a specific number of chars.
  • descr() now also reports the percentage of missing values.
  • set_na() no longer gives a warning when trying to replace values with NA for vectors that completely contained NAs.

Bug fixes

  • descr() now also works on single vectors as data argument.
  • Fixed bugs with write_*()-functions.

sjmisc 2.3.0

General

  • Added package-vignettes.
  • Functions were largely revised to work seamlessly within the tidyverse. This may break existing code, but in the long run, consistent api-design makes working with the package more intuitive. See vignette("design_philosophy", "sjmisc") for more details.
  • as_labelled() only converts vectors into labelled-class if vector has label attributes. This ensures that data can be properly saved into other formats, e.g. with write_spss().
  • The write_*()-functions have been revised and should now save data frame with any common classes of vectors (labelled, factor, numeric, atomic...).

New functions

  • center() and std() are moving from package sjstats to sjmisc.
  • add_columns() to bind columns of first data frame at the end of all data frames.

Changes to functions

  • frq() now has the same argument-structure as flat_table().
  • Following functions now follow a consistent tidyverse-approach, with the data being the first argument, followed by variable names: add_labels(), replace_labels(), remove_labels(), count_na(), rec(), dicho(), split_var(), drop_labels(), fill_labels(), group_var(), group_labels(), ref_lvl(), recode_to(), replace_na(), set_na() and set_labels().
  • get_values() now sorts returned values by default, to be consistent with get_labels().
  • spread_coef() gets arguments se and p.val, to define whether standard errors and p-values should be included in the return value or not.

Bug fixes

  • merge_df() did not copy label attributes for data frame with identical variables (that were row-bound).
  • to_value() did not work for character vectors of class labelled, with empty values (which typically have no value label).

sjmisc 2.2.1

Bug fixes

  • The sort.frq did not work frq().

sjmisc 2.2.0

New functions

  • zap_inf() to "clean" vectors from NaN and infinite values.
  • descr() to provide basic descriptive statistics (similar to describe() in the psych-package), but including variable labels and usable in pipe-workflows. Also works with grouped data frames.

Changes to functions

  • rec(), split_var() and dicho() get an argument suffix, to append a suffix to variable (column) names, if applied on a data frame.
  • Value labels in rec() can now directly be assigned inside the recodes-syntax (see 'Details' in ?rec).
  • find_var() gets a as.df-argument, to return a data frame with matching variables, instead of their column indices only.
  • find_var() gets a as.varlab-argument, to return a "summary" data frame with column number, variable name and variable label.
  • flat_table() now also accepts grouped data frames.
  • flat_table() gets a show.values-argument, to add values to associated labels in output.
  • frq() now also accepts grouped data frames.
  • frq() gets a weight.by-argument to weight frequencies.
  • set_na() can now also find values by their value labels and replace them with NA.
  • set_na() now removes unused value labels from values that have been replaced with NA.
  • The as.tag-argument in set_na() now defaults to FALSE.
  • get_labels() now always returns labels in sorted order of the associated values.
  • get_labels() gets a drop.unused-argument, to automatically drop labels from values that don't occur in the vector.
  • For a named vector as labels-argument, set_labels() now always sorts labels in sorted order of the associated values.
  • is_empty() gets a first.only-argument, to evaluate either first or all elements of a character vector.

Bug fixes

  • set_na() did not work on vectors of class Date when argument as.tag = TRUE.
  • flat_table() did not show values that had no value labels. Now all categories are shown in the frequency table.
  • rec() did not properly copy labels of tagged NA values when not all recoded values appeared in the vector.
  • frq() did not show correct values, when value labels of a vector were not sorted according their values.
  • set_labels() did not set labels properly for ordered factors.
  • remove_labels() returned NA-values for value labels (instead of no value labels) when the last value label of a vector was removed.

sjmisc 2.1.0

New functions

  • find_var() to find variables in data frames by name or label.
  • var_labels() as "tidyversed" alternative to set_label() to set variable labels.
  • var_rename() to rename variables.

Changes to functions

  • Following functions now get an ellipses-argument ..., to apply function only to selected variables, but return the complete data frame (thus, overwriting existing variables in a data frame, if requested): to_factor(), to_value(), to_label(), to_character(), to_dummy(), zap_labels(), zap_unlabelled(), zap_na_tags().

Bug fixes

  • Fixed bug with copying attributes from tibbles for merge_df().
  • Fixed wrong argument-description in docs of frq().

sjmisc 2.0.1

General

  • Removed package coin from Imports.

New functions

  • count_na() to print a frequency table of tagged NA values.

Changes to functions

  • set_na() gets a drop.levels argument to keep or drop factor levels of values that have been replaced with NA.
  • set_na() gets a as.tag argument to set NA values as regular or tagged NA.

sjmisc 2.0.0

General

  • sjmisc now supports tagged NA values, a new structure for labelled missing values introduced by the haven-package. This means that functions or arguments that are no longer useful, have been removed while other functions dealing with NA values have been largely revised.
  • All statistical functions have been removed and are now in a separate package, sjstats.
  • Removed some S3-methods for labelled-class, as these are now provided by the haven-package.
  • Functions no longer check input for type matrix, to avoid conflicts with scaled vectors (that were recognized as matrix and hence treated as data frame).
  • table(*, exclude = NULL) was changed to table(*, useNA = "always"), because of planned changes in upcoming R version 3.4.
  • More functions (like trim() or frq()) now also have data frame- or list-methods.

New functions

  • zap_na_tags() to turn tagged NA values into regular NA values.
  • spread_coef() to spread coefficients of multiple fitted models in nested data frames into columns.
  • merge_imputations() to find the most likely imputed value for a missing value.
  • flat_table() to print flat (proportional) tables of labelled variables.
  • Added to_character() method.
  • big_mark() to format large numbers with big marks.
  • empty_cols() and empty_rows() to find variables or observations with exclusively NA values in a data frame.
  • remove_empty_cols() and remove_empty_rows() to remove variables or observations with exclusively NA values from a data frame.

Changes to functions

  • str_contains() gets a switch argument to switch the role of x and pattern.
  • word_wrap() coerces vectors to character if necessary.
  • to_label() gets a var.label and drop.levels argument, and now preserves variable labels by default.
  • Argument def.value in get_label() now also applies to data frame arguments.
  • If factor levels are numeric and factor has value labels, these are used in to_value() by default.
  • to_factor() no longer generates NA or NaN-levels when converting input into factors.

Bug fixes

  • rec() did not recode values, when these were the first element of a multi-line string of the recodes argument.
  • is_empty() returned NA instead of TRUE for empty character vectors.
  • Fixed bug with erroneous assignment of value labels to subset data when using copy_labels() (#20)

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.