A general framework for constructing partial dependence (i.e., marginal effect) plots from various types machine learning models in R.

Complex nonparametric models---like neural networks, random forests, and support vector machines---are more common than ever in predictive analytics, especially when dealing with large observational databases that don't adhere to the strict assumptions imposed by traditional statistical techniques (e.g., multiple linear regression which assumes linearity, homoscedasticity, and normality). Unfortunately, it can be challenging to understand the results of such models and explain them to management. Partial dependence plots offer a simple solution. Partial dependence plots are low-dimensional graphical renderings of the prediction function $\widehat{f}\left(\boldsymbol{x}\right)$ so that the relationship between the outcome and predictors of interest can be more easily understood. These plots are especially useful in explaining the output from black box models. The `pdp`

package offers a general framework for constructing partial dependence plots for various types of fitted models in R.

A detailed introduction to `pdp`

has been published in The R Journal: "pdp: An R Package for Constructing Partial Dependence Plots", https://journal.r-project.org/archive/2017/RJ-2017-016/index.html. You can track development at https://github.com/bgreenwell/pdp. To report bugs or issues, contact the main author directly or submit them to https://github.com/bgreenwell/pdp/issues.

As of right now, `pdp`

exports four functions:

`partial`

- compute partial dependence functions (i.e., objects of class`"partial"`

) from various fitted model objects;`plotPartial"`

- plot partial dependence functions (i.e., objects of class`"partial"`

) using`lattice`

graphics;`autoplot`

- plot partial dependence functions (i.e., objects of class`"partial"`

) using`ggplot2`

graphics;`topPredictors`

extract most "important" predictors from various types of fitted models.

The `pdp`

package is currently listed on CRAN and can easily be installed:

install.packages("pdp")# Alternatively, install the development version from GitHubdevtools::install_github("bgreenwell/pdp")

As a first example, we'll fit a random forest to the famous Boston housing data included with the package (see `?boston`

for details). In fact the original motivation for this package was to be able to compute two-predictor partial dependence plots from random forest models in R.

# Fit a random forest to the Boston housing datalibrary(randomForest) # install.packages("randomForest")data (boston) # load the boston housing dataset.seed(101) # for reproducibilityboston.rf <- randomForest(cmedv ~ ., data = boston)# Partial dependence of cmedv on lstat and rmlibrary(pdp)pd <- partial(boston.rf, pred.var = c("lstat", "rm"), chull = TRUE)head(pd) # print first 6 rows#> lstat rm yhat#> 1 7.5284 3.66538 24.13683#> 2 8.2532 3.66538 23.24916#> 3 8.9780 3.66538 23.13119#> 4 9.7028 3.66538 22.13531#> 5 10.4276 3.66538 20.62331#> 6 11.1524 3.66538 20.51258# Lattice versionp1 <- plotPartial(pd, main = "lattice version")# ggplot2 versionlibrary(ggplot2)p2 <- autoplot(pd, contour = TRUE, main = "ggplot2 version",legend.title = "Partial\ndependence")# Show both plots in one figuregrid.arrange(p1, p2, ncol = 2)

Next, we'll fit a classification model to the Pima Indians Diabetes data.

As a second example, we'll fit an SVM to the Pima Indians diabetes data included with the package (see `?pima`

for details). Note that for some fitted model objects (e.g., `"ksvm"`

objects) it is necessary to supply the original training data via the `train`

argument in the call to `partial`

.

# Fit an SVM to the Pima Indians diabetes datalibrary(kernlab) # install.packages("kernlab")data (pima) # load the Pima Indians diabetes datapima.svm <- ksvm(diabetes ~ ., data = pima, type = "C-svc", kernel = "rbfdot",C = 0.5, prob.model = TRUE)# Partial dependence of diabetes test result on glucose (default is logit scale)pd.glucose <- partial(pima.svm, pred.var = "glucose", train = pima)# Partial dependence of diabetes test result on glucose (probability scale)pd.glucose.prob <- partial(pima.svm, pred.var = "glucose", prob = TRUE,train = pima)# Show both plots in one figuregrid.arrange(autoplot(pd.glucose, main = "Logit scale"),autoplot(pd.glucose.prob, main = "Probability scale"),ncol = 2)

- Properly registered native routines and disabled symbol search.
- Fixed a bug for
`gbm`

models using the multinomial distribution. - Refactored code to improve structure.
`partial`

gained three new options:`inv.link`

(experimental),`ice`

, and`center`

. The latter two have to do with constructing individual conditional expectation (ICE) curves and cetered ICE (c-ICE) curves. The`inv.link`

option is for transforming predictions from models that can use non-Gaussian distibutions (e.g.,`glm`

,`gbm`

, and`xgboost`

). Note that these options were added for convenience and the same results (plus much more) can still be obtained using the flexible`pred.fun`

argument. (#36).`plotPartial`

gained five new options:`center`

,`plot.pdp`

,`pdp.col`

,`pdp.lwd`

, and`pdp.lty`

; see`?plotPartial`

for details.- Fixed default y-axis label for
`autoplot`

with two numeric predictors (#48). - Added
`CITATION`

file. - Better support for neuaral networks from the
`nnet`

package. - Fixed a bug for
`nnet::multinom`

models with binary response.

- Fixed minor pandoc conversion issue with
`README.md`

. - Added subdirectory called
`tools`

to hold figures for`README.md`

.

- Registered native routines and disabled symbol search.

- Added support for
`MASS::lda`

,`MASS::qda`

, and`mda::mars`

. - New arguments
`quantiles`

,`probs`

, and`trim.outliers`

in`partial`

. These arguments make it easier to construct PDPs over the relevant range of a numeric predictor without having to specify`pred.grid`

, especially when outliers are present in the predictors (which can distort the plotted relationship). - The
`train`

argument can now accept matrices; in particular, object of class`"matrix"`

or`"dgCMatrix"`

. This is useful, for example, when working with XGBoost models (i.e., objects of class`"xgb.Booster"`

). - New logical argument
`prob`

indicating whether or not partial dependence values for classification problems should be returned on the original probability scale, rather than the centered logit; details for the centered logit can be found on page 370 in the second edition of*The Elements of Statistical Learning*. - Fixed some typos in
`NEWS.md`

. - New function
`autoplot`

for automatically creating`ggplot2`

graphics from`"partial"`

objects.

`partial`

is now much faster with`"gbm"`

object due to a call to`gbm::plot.gbm`

whenever`pred.grid`

is not explicitly given by the user. (`gbm::plot.gbm`

exploits a computational shortcut that does not involve any passes over the training data.)- New (experimental) function
`topPredictors`

for extracting the names of the most "important" predictors. This should make it one step easier (in most cases) to construct PDPs for the most "important"" features in a fitted model. - A new argument,
`pred.fun`

, allows the user to supply their own prediction function. Hence, it is possible to obtain PDPs based on the median, rather than the mean. It is also possible to obtain PDPs for classification problems on the probability scale. See`?partial`

for examples. - Minor bug fixes and documentation tweaks.

- The
`...`

argument in the call to`partial`

now refers to additional arguments to be passed onto`stats::predict`

rather than`plyr::aaply`

. For example, using`partial`

with`"gbm"`

objects will require specification of`n.trees`

which can now simply be passed to`partial`

via the`...`

argument. - Added the following arguments to
`partial`

:`progress`

(`plyr`

-based progress bars),`parallel`

(`plyr`

/`foreach`

-based parallel execution), and`paropts`

(list of additional arguments passed onto`foreach`

when`parallel = TRUE`

). - Various bug fixes.
`partial`

now throws an informative error message when the`pred.grid`

argument refers to predictors not in the original training data.- The column name for the predicted value has been changed from
`"y"`

to`"yhat"`

.

`randomForest`

is no longer imported.- Added support for the
`caret`

package (i.e., objects of class`"train"`

). - Added example data sets:
`boston`

(corrected Boston housing data) and`pima`

(corrected Pima Indians diabetes data). - Fixed error that sometimes occurred when
`chull = TRUE`

causing the convex hull to not be computed. - Refactored
`plotPartial`

to be more modular. - Added
`gbm`

support for most non-`"binomial"`

families`.

`randomForest`

is now imported.- Added examples.

- Fixed a non canonical CRAN URL in the README file.

`partial`

now makes sure each column of`pred.grid`

has the correct class, levels, etc.`partial`

gained a new option,`levelplot`

, which defaults to`TRUE`

. The original option,`contour`

, has changed and now specifies whether or not to add contour lines whenever`levelplot = TRUE`

.

- Fixed a number of URLs.
- More thorough documentation.

- Fixed a couple of URLs and typos.
- Added more thorough documentation.
- Added support for C5.0, Cubist, nonlinear least squares, and XGBoost models.

- Initial release.