This package contains various tools for working with and evaluating cross-validated area under the ROC curve (AUC) estimators. The primary functions of the package are ci.cvAUC and ci.pooled.cvAUC, which report cross-validated AUC and compute confidence intervals for cross-validated AUC estimates based on influence curves for i.i.d. and pooled repeated measures data, respectively. One benefit to using influence curve based confidence intervals is that they require much less computation time than bootstrapping methods. The utility functions, AUC and cvAUC, are simple wrappers for functions from the ROCR package.
The cvAUC
R package provides a computationally efficient means of estimating confidence intervals (or variance) of cross-validated Area Under the ROC Curve (AUC) estimates.
In binary classification problems, the AUC is commonly used to evaluate the performance of a prediction model. Often, it is combined with cross-validation in order to assess how the results will generalize to an independent data set. In order to evaluate the quality of an estimate for cross-validated AUC, we obtain an estimate of its variance.
For massive data sets, the process of generating a single performance estimate can be computationally expensive. Additionally, when using a complex prediction method, the process of cross-validating a predictive model on even a relatively small data set can still require a large amount of computation time. Thus, in many practical settings, the bootstrap is a computationally intractable approach to variance estimation. As an alternative to the bootstrap, a computationally efficient influence curve based approach to obtaining a variance estimate for cross-validated AUC can be used.
The primary functions of the package are ci.cvAUC
and ci.pooled.cvAUC
, which report cross-validated AUC and compute confidence intervals for cross-validated AUC estimates based on influence curves for i.i.d. and pooled repeated measures data, respectively. One benefit to using influence curve based confidence intervals is that they require much less computation time than bootstrapping methods. The utility functions, AUC
and cvAUC
, are simple wrappers for functions from the ROCR package.
Erin LeDell, Maya L. Petersen & Mark J. van der Laan, "Computationally Efficient Confidence Intervals for Cross-validated Area Under the ROC Curve Estimates." (In Review)
You can install:
the latest released version from CRAN with
install.packages("cvAUC")
the latest development version from GitHub with
if (packageVersion("devtools") < 1.6) {install.packages("devtools")}devtools::install_github("ledell/cvAUC")
Here is a quick demo of how you can use the package. In this example we do the following:
folds
. Below, the function that creates the folds is called .cvFolds
.{1,...,10}\v
) and then using this saved fit, generate predicted values for the observations in the v^{th} validation fold. The .doFit
function below does this procedure. In this example, we the Random Forest algorithm.predictions
.ci.cvAUC
function to calculate CV AUC and to generate a 95% confidence interval for this CV AUC estimate.First, we define a few utility functions:
.cvFolds <- function(Y, V){ # Create CV folds (stratify by outcome) Y0 <- split(sample(which(Y==0)), rep(1:V, length=length(which(Y==0)))) Y1 <- split(sample(which(Y==1)), rep(1:V, length=length(which(Y==1)))) folds <- vector("list", length=V) for (v in seq(V)) {folds[[v]] <- c(Y0[[v]], Y1[[v]])} return(folds)} .doFit <- function(v, folds, train){ # Train & test a model; return predicted values on test samples set.seed(v) ycol <- which(names(train)==y) params <- list(x = train[-folds[[v]],-ycol], y = as.factor(train[-folds[[v]],ycol]), xtest = train[folds[[v]],-ycol]) fit <- do.call(randomForest, params) pred <- fit$test$votes[,2] return(pred)}
This function will execute the example:
iid_example <- function(train, y = "V1", V = 10, seed = 1) { # Create folds set.seed(seed) folds <- .cvFolds(Y = train[,c(y)], V = V) # Generate CV predicted values cl <- makeCluster(detectCores()) registerDoParallel(cl) predictions <- foreach(v = 1:V, .combine="c", .packages=c("randomForest")) %dopar% .doFit(v, folds, train) stopCluster(cl) predictions[unlist(folds)] <- predictions # Get CV AUC and 95% confidence interval runtime <- system.time(res <- ci.cvAUC(predictions = predictions, labels = train[,c(y)], folds = folds, confidence = 0.95)) print(runtime) return(res)}
Load a sample binary outcome training set into R:
train_csv <- "http://www.stat.berkeley.edu/~ledell/data/higgs_10k.csv"train <- read.table(train_csv, sep=",")
Run the example:
library(randomForest)library(doParallel)library(cvAUC) res <- iid_example(train = train, y = "V1", V = 10, seed = 1)print(res) # $cvAUC# [1] 0.7813759## $se# [1] 0.004534395# # $ci# [1] 0.7724886 0.7902631# # $confidence# [1] 0.95
For the example above (10,000 observations), it took ~0.2 seconds to calculate the cross-validated AUC and the influence curve based confidence intervals. This was benchmarked on a 2.3 GHz Intel Core i7 processor using cvAUC
package version 1.1.0.
For bigger (i.i.d.) training sets, here are a few rough benchmarks:
labels
or predictions
to a vector if it's a 1-column data.frame. Otherwise, it will fail the check that length(unique(labels)) == 2
and length(predictions) == length(labels)
.ci.cvAUC
and ci.pooled.cvAUC
to improve runtime performance by many orders of magnitude.data.table
package.data.table
sorting.R CMD CHECK
will produce a note mentioning "No visible binding for global variable..." for several lines in the data.table
-related code. This is nothing to worry about. Read more here: https://stackoverflow.com/questions/8096313/no-visible-binding-for-global-variable-note-in-r-cmd-checkREADME.md
file for GitHub. This package is now on GitHub at: https://github.com/ledell/cvAUCAUC
function to be able to use the label.ordering
argument, similar to cvAUC
.ci.cvAUC
and ci.pooled.cvAUC
functions as well as .process_input
.AUC
utility function for simple AUC calculation (no cross-validation).ci.pooled.cvAUC
function to documentation.covProb.sim
simulation function. This simulation should no longer be used.ROCR
package name to the ROCR::performance
and ROCR::prediction
functions inside the AUC
and cvAUC
functions.require(ROCR)
from functions since ROCR is a required dependency.