Tools for Easily Combining and Cleaning Data Sets

Tools for combining and cleaning data sets, particularly with grouped and time series data.


Christopher Gandrud

Version 0.2.21

Please report any bugs or suggestions at: https://github.com/christophergandrud/DataCombine/issues.

DataCombine is a set of miscellaneous tools intended to make combining data sets--especially time-series cross-section data--easier. The package is continually being developed as I turn lines of code that I frequently use into single functions. It currently includes the following functions:

  • CasesTable function added to report cases after listwise deletion of missing values for time-series cross-sectional data.

  • change: calculates the absolute, percentage, and proportion change from a specified lag, including within groups.

  • CountSpell: function that returns a variable counting the spell number for an observation. Works with grouped data.

  • dMerge: merges 2 data frames and report/drop/keeps only duplicates.

  • DropNA: drops rows from a data frame when they have missing (NA) values on a given variable(s).

  • FillDown: fills in missing (NA) values with the previous non-missing value

  • FillIn: fills in missing values of a variable from one data frame with the values from another variable.

  • FindDups: find duplicated values in a data frame and subset it to either include or not include them.

  • FindReplace: replaces multiple patterns found in a character string column of a data frame.

  • grepl.sub: subsets a data frame if a specified pattern is found in a character string.

  • InsertRow: allows user to insert a row into a data frame. Largely implements: Ari B. Friedman's function.

  • MoveFront: moves variables to the front of a data frame. This can be useful if you have a data frame with many variables and want to move a variable or variables to the front.

  • NaVar: create new variable(s) indicating if there are missing values in other variable(s).

  • shift: creates lag and lead variables, including for time-series cross-sectional data. The shifted variable is returned to a new vector. This function is largely based on TszKin Julian's shift function.

  • slide: creates lag and lead variables, including for time-series cross-sectional data. The slid variable are added to the original data frame. This expands the capabilities of shift.

  • slideMA: creates a moving average for a period before or after each time point for a given variable.

  • SpreadDummy: spread a dummy variable (1's and 0') over a specified time period and for specified groups.

  • StartEnd: finds the starting and ending time points of a spell, including for time-series cross-sectional data.

  • rmExcept: removes all objects from a workspace except those specified by the user.

  • TimeExpand: expands a data set so that it includes an observation for each time point in a sequence. Works with grouped data.

  • TimeFill: creates a continuous Unit-Time-Dummy data frame from a data frame with Unit-Start-End times.

  • VarDrop: drops one or more variables from a data frame.

I will continue to add to the package as I build data sets and run across other pesky tasks I do repeatedly that would be simpler if they were completed by a single function.

DataCombine is on CRAN.

You can also install the most recent stable version with install_github from the devtools:

devtools::install_github('christophergandrud/DataCombine')

News

  • CasesTable function added to report cases after listwise deletion of missing values for time-series cross-sectional data.

  • Fixed a bug where CountSpell would fail if all SpellVar values were the same.

  • Improved documentation.

## Version 0.2.20

  • Internal changes to enable compatability with dplyr version 0.4.4.

  • The Var argument in dMerge is deprecated. Use by instead to match merge syntax more closely.

  • DropNA drops all NAs from the supplied data frame if Var is missing.

## Version 0.2.19

  • Fixed a bug in dMerge that overwrote users suffixes specification. Thanks to @hplieninger for spotting this.

  • FillIn intelligently fails if Var1 and Var2 do not exist in their respective data.frames.

  • SpreadDummy now starts by coercing the data to a data frame, so that it works more smoothly with dplyr data tables.
  • change calculate the changes (absolute, percent, and proportion) changes from a specified lag, including within groups. Replaced PercChange which has the same syntax, but did not find absolute changes.
  • FindDups code and documentation improvements.
  • FindDups for finding duplicated values in a data frame and subsetting it to either include or not include them.

## Version 0.2.14

  • grepl.sub now uses dot notation to pass arguments to grepl.
  • Improved error handling and documentation in CountSpell.

Improved error handling in FillDown for missing variables.

Allow arguments for slide to be passed throught StartEnd.

slide convert with tbl_df.

Fixed a bug in slide when using factor class variables as GroupVar.

Minor change to PercChange allowing arguments to be passed to slide.

Minor internal code improvements.

FillDown down accepts just vectors. This is useful when used with dplyr's group_by and mutate/summarize functions.

Added TimeVar argument to slide. The argument is optional. When the argument is specified, then the data is ordered by Var-TimeVar before sliding.

  • Internal code improvements.

Added function TimeExpand. This expands a data set so that it includes an observation for each time point in a sequence. Works with grouped data.

  • Error message added to slide when using grouped data.

Added dplyr version depedency >= 0.3.

Minor FillIn change checking for missing rather than NULL for Var2.

Fixed a bug in slide when there are invalid sliding values.

  • Fixed a misspelling in the CountSpell documentation. Thanks to Sascha Schuster.
  • Minor error handling improvements to grepl.sub.
  • Added FillDown a function that fills in missing (NA) values with the previous non-missing value.

  • Internal code formating improvements.

  • keepInvalid argument added to slide. This allows the user to specify whether or not to keep observations for groups for which no valid lag/lead can be created due to an insufficient number of time points. If TRUE then these groups are returned to the bottom of the data frame and NA is given for their new lag/lead variable value.

Minor internal improvement where slide (and associated functions) return only data frame class objects

Minor documentation improvements to slideMA.

Internal improvements to MoveFront.

Added CountSpell, a function that returns a variable counting the spell number for an observation. Works with grouped data.

Added an OnlyStart argument to StartStop to return only a new Spell_Start variable.

Documentation and internal code improvements.

FillIn internal code improvements.

!!! grepl.sub argument patterns changed to pattern. !!!

InsertRow function allows user to insert a row into a data frame. Largely implements: http://stackoverflow.com/a/11562428.

Minor documentation improvements.

Minor internal code cleaning.

No longer depends on the forecast package.

slide and slideMA now rely on dplyr rather than plyr for speed improvements.

Minor StartEnd bug fix.

StartEnd function added. Finds the starting and ending time points of a spell.

Small improvement to slide to drop temporary 'fake' variable' before returning the data frame.

Small improvement to FillIn so that it doesn't attempt to find correlations if Var2 is not numeric.

Error message improvements.

Added SpreadDummy to spread a dummy variable (1's and 0') over a specified time period and for specified groups.

Improved error handling in slide for when the number of rows in a group is smaller than the argument set with slideBy.

Minor argument default changes in FindReplace.

Added slideMA for creating a moving average for a period before or after each time point for a given variable.

Message improvements to shift

Added NaVar for creating new variable(s) indicating if there are missing values in other variable(s).

Minor internal improvements to rmExcept.

Added TimeFill for creating a continuous Unit-Time-Dummy data frame from a data frame with Unit-Start-End times.

Added message option to DropNA. Allows the user to turn of the drop message.

Added VarDrop for dropping one or more variables from a data frame.

Added dMerge which merges 2 data frames and reports/drops/keeps only duplicates.

Improve messages for slide.

Added exact, ignore.case, and fixed arguments to MoveFront for more flexible variable name matching. Thanks to Felix Haas for the suggestion.

Minor documentation improvements.

Added PercChange for calculating the percentage change from a specified lag, including within groups.

Added exact argument to FindReplace to only replace exact pattern matches. Also added argument vector to return a vector of the replacement variable rather than the whole data frame.

Improved warnings for FindReplace.

Added FindReplace function to replace multiple patterns found in a character string column of a data frame

Minor documentation improvements.

Added grepl.sub function to subset a data frame if a specified pattern is found in a character string.

Added the utility rmExcept for removing all objects from a workspace except for those specified by the user.

Speed improvements made to slide. Thanks to briatte.

FillIn now allows you to drop variables from D2 in the resulting data set.

MoveFront can now move more than one variable to the front of a data frame.

Minor bug fixes in FillIn.

Bug fix for slide argument NewVar.

Documentation improvements.

Added slide and shift functions for lagging and leading variables. shift is largely based on TszKin Julian's shift function: http://ctszkin.com/2012/03/11/generating-a-laglead-variables/.

Added DropNA to drop observations in a data frame with missing values for given variables.

Stop message for MoveFront when variable does not exists. Other minor documentation changes.

Minor update to FillIn for data.frame 1.8.8.

First version including the FillIn and MoveFront commands.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("DataCombine")

0.2.21 by Christopher Gandrud, a year ago


http://CRAN.R-project.org/package=DataCombine


Report a bug at https://github.com/christophergandrud/DataCombine/issues


Browse source code at https://github.com/cran/DataCombine


Authors: Christopher Gandrud [aut, cre]


Documentation:   PDF Manual  


GPL (>= 3) license


Imports data.table, dplyr

Suggests devtools, testthat


Imported by psData.

Suggested by dynsim.


See at CRAN