A collection of machine learning helper functions, particularly assisting in the Exploratory Data Analysis phase. Makes heavy use of the 'data.table' package for optimal speed and memory efficiency. Highlights include a versatile bin_data() function, sparsify() for converting a data.table to sparse matrix format with one-hot encoding, fast evaluation metrics, and empirical_cdf() for calculating empirical Multivariate Cumulative Distribution Functions.
Exploratory and diagnostic machine learning tools for R
The goal of this package is multifold:
install.packages("mltools")
install.packages("devtools")devtools::install_github("ben519/mltools")
Predict whether or not someone is an alien.
library(data.table)library(mltools) # Copy the toy datasets since they are locked from being modifiedtrain <- copy(alien.train)test <- copy(alien.test) train SkinColor IQScore Cat1 Cat2 Cat3 IsAlien1: green 300 type1 type1 type4 TRUE2: white 95 type1 type2 type4 FALSE3: brown 105 type2 type6 type11 FALSE4: white 250 type4 type5 type2 TRUE5: blue 115 type2 type7 type11 TRUE6: white 85 type4 type5 type2 FALSE7: green 130 type1 type2 type4 TRUE8: white 115 type1 type1 type4 FALSE test SkinColor IQScore Cat1 Cat2 Cat31: white 79 type4 type5 type22: green 100 type4 type5 type23: brown 125 type3 type9 type74: white 90 type1 type8 type45: red 115 type1 type2 type4
# Combine train (excluding IsAlien) and testalien.all <- rbind(train[, !"IsAlien", with=FALSE], test) #--------------------------------------------------## Check for correlated and hierarchical fields gini_impurities(alien.all, wide=TRUE) # weighted conditional gini impurities Var1 Cat1 Cat2 Cat3 SkinColor1: Cat1 0.0000000 0.3589744 0.0000000 0.47435902: Cat2 0.0000000 0.0000000 0.0000000 0.34615383: Cat3 0.0000000 0.3589744 0.0000000 0.47435904: SkinColor 0.4102564 0.5384615 0.4102564 0.0000000 # (Cat1, Cat3) = (Cat3, Cat1) = 0 => Cat1 and Cat3 perfectly correspond to each other# (Cat1, Cat2) > 0 and (Cat2, Cat1) = 0 => Cat1-Cat2 exhibit a parent-child relationship.# You can guess Cat1 by knowing Cat2, but not vice-versa. #--------------------------------------------------## Check relationship between IQScore and IsAlien by binning IQScore into groups train[, BinIQScore := bin_data(IQScore, bins=seq(0, 300, by=50))] IQScore BinIQScore1: 300 [250, 300]2: 95 [50, 100)3: 105 [100, 150)4: 250 [250, 300]5: 115 [100, 150)6: 85 [50, 100)7: 130 [100, 150)8: 115 [100, 150) train[, list(Samples=.N, IQScore=mean(IQScore)), keyby=BinIQScore] BinIQScore Samples IQScore1: [50, 100) 2 90.002: [100, 150) 4 116.253: [250, 300] 2 275.00 # Remove column BinIQScoretrain[, BinIQScore := NULL] #--------------------------------------------------## Check skewness of fields skewness(alien.all)$SkinColor SkinColor Count Pcnt1: white 6 0.461538462: green 3 0.230769233: brown 2 0.153846154: blue 1 0.076923085: red 1 0.07692308 $Cat1 Cat1 Count Pcnt1: type1 6 0.461538462: type4 4 0.307692313: type2 2 0.153846154: type3 1 0.07692308...
set.seed(711) #--------------------------------------------------## Set SkinColor as a factor, such that it has the same levels in train and test## Set low frequency skin colors (1 or fewer occurences) as "_other_" skincolors <- list(train$SkinColor, test$SkinColor)skincolors <- set_factor(skincolors, aggregationThreshold=1)train[, SkinColor := skincolors[[1]] ] # update train with the new valuestest[, SkinColor := skincolors[[2]] ] # update test with the new values # Repeat the process above for other categorical fields (without setting low freq. values as "_other_")for(col in c("Cat1", "Cat2", "Cat3")){ vals <- list(train[[col]], test[[col]]) vals <- set_factor(vals) set(train, j=col, value=vals[[1]]) set(test, j=col, value=vals[[2]])} #--------------------------------------------------## Randomly split the training data into 2 equally sized datasets # Partition train into two folds, stratified by IsAlientrain[, FoldID := folds(IsAlien, nfolds=2, stratified=TRUE, seed=2016)] cvtrain <- train[FoldID==1, !"FoldID"] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien1: green 300 type1 type1 type4 TRUE2: brown 105 type2 type6 type11 FALSE3: green 130 type1 type2 type4 TRUE4: white 115 type1 type1 type4 FALSE cvtest <- train[FoldID==2, !"FoldID"] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien1: white 95 type1 type2 type4 FALSE2: white 250 type4 type5 type2 TRUE3: _other_ 115 type2 type7 type11 TRUE4: white 85 type4 type5 type2 FALSE #--------------------------------------------------## Convert cvtrain and cvtest to sparse matrices## Note that unordered factors are one-hot-encoded library(Matrix) cvtrain.sparse <- sparsify(cvtrain)4 x 21 sparse Matrix of class "dgCMatrix" SkinColor__other_ SkinColor_brown SkinColor_green SkinColor_white IQScore Cat1_type1 ...[1,] . . 1 . 300 1[2,] . 1 . . 105 .[3,] . . 1 . 130 1[4,] . . . 1 115 1 cvtest.sparse <- sparsify(cvtest)4 x 21 sparse Matrix of class "dgCMatrix" SkinColor__other_ SkinColor_brown SkinColor_green SkinColor_white IQScore Cat1_type1 ...[1,] . . . 1 95 1[2,] . . . 1 250 .[3,] 1 . . . 115 .[4,] . . . 1 85 .
#--------------------------------------------------## Naive model that guesses someone is an alien if their IQScore is > 130 cvtest[, Prediction := ifelse(IQScore > 130, TRUE, FALSE)] #--------------------------------------------------## Evaluate predictions # Area Under the ROC Curve (AUC ROC)auc_roc(preds=cvtest$Prediction, actuals=cvtest$IsAlien)0.75 # Individual scores to determine which predictions were good/bad (see help(roc_scores) for details)cvtest[, ROCScore := roc_scores(preds=Prediction, actuals=IsAlien)]cvtest[order(ROCScore)] SkinColor IQScore Cat1 Cat2 Cat3 IsAlien Prediction ROCScore1: white 95 type1 type2 type4 FALSE FALSE 0.00000002: white 250 type4 type5 type2 TRUE TRUE 0.00000003: white 85 type4 type5 type2 FALSE FALSE 0.00000004: _other_ 115 type2 type7 type11 TRUE FALSE 0.1666667
If you'd like to contact me regarding bugs, questions, or general consulting, feel free to drop me a line - [email protected]
date_factor(dateVec, ...)
fixed bug in "character string is not in a standard unambiguous format" produced by some date valuesfolds(x, ...)
x
can now be a positive integer specifying the number of fold IDs to generatedate_factor(dateVec, ...)
the argument fullyears
has been dropped and replaced by the more flexible pair of arguments minDate
and maxDate
for determining resulting vector levels. Additionally a bug regarding type=yearquarters
has been fixed.bin_data()
added some input validationbin_data()
occuring when x
is integery type and bins
includes Inf
or -Inf
date_factor()
occuring when type = "yearquarter" and fullyears = FALSEempirical_cdf()
date_factor()
for converting dates to a grouped ordered factor (e.g. months, yearmonths, yearquarters, etc.)x
in folds(x, stratified=TRUE, ...)
Initial Release