Task view: Multivariate Statistics

Last updated on 2018-07-21 by Paul Hewson

Base R contains most of the functionality for classical multivariate analysis, somewhere. There are a large number of packages on CRAN which extend this methodology, a brief overview is given below. Application-specific uses of multivariate statistics are described in relevant task views, for example whilst principal components are listed here, ordination is covered in the Environmetrics task view. Further information on supervised classification can be found in the MachineLearning task view, and unsupervised classification in the Cluster task view.

The packages in this view can be roughly structured into the following topics. If you think that some package is missing from the list, please let me know.

Visualising multivariate data

  • Graphical Procedures: A range of base graphics (e.g. pairs() and coplot()) and lattice functions (e.g. xyplot() and splom()) are useful for visualising pairwise arrays of 2-dimensional scatterplots, clouds and 3-dimensional densities. scatterplot.matrix in the car provides usefully enhanced pairwise scatterplots. Beyond this, scatterplot3d provides 3 dimensional scatterplots, aplpack provides bagplots and spin3R(), a function for rotating 3d clouds. misc3d, dependent upon rgl, provides animated functions within R useful for visualising densities. YaleToolkit provides a range of useful visualisation techniques for multivariate data. More specialised multivariate plots include the following: faces() in aplpack provides Chernoff's faces; parcoord() from MASS provides parallel coordinate plots; stars() in graphics provides a choice of star, radar and cobweb plots respectively. mstree() in ade4 and spantree() in vegan provide minimum spanning tree functionality. calibrate supports biplot and scatterplot axis labelling. geometry, which provides an interface to the qhull library, gives indices to the relevant points via convexhulln(). ellipse draws ellipses for two parameters, and provides plotcorr(), visual display of a correlation matrix. denpro provides level set trees for multivariate visualisation. Mosaic plots are available via mosaicplot() in graphics and mosaic() in vcd that also contains other visualization techniques for multivariate categorical data. gclus provides a number of cluster specific graphical enhancements for scatterplots and parallel coordinate plots See the links for a reference to GGobi. rggobi interfaces with GGobi. xgobi interfaces to the XGobi and XGvis programs which allow linked, dynamic multivariate plots as well as projection pursuit. Finally, iplots allows particularly powerful dynamic interactive graphics, of which interactive parallel co-ordinate plots and mosaic plots may be of great interest. Seriation methods are provided by seriation which can reorder matrices and dendrograms.
  • Data Preprocessing: summarize() and summary.formula() in Hmisc assist with descriptive functions; from the same package varclus() offers variable clustering while dataRep() and find.matches() assist in exploring a given dataset in terms of representativeness and finding matches. Whilst dist() in base and daisy() in cluster provide a wide range of distance measures, proxy provides a framework for more distance measures, including measures between matrices. simba provides functions for dealing with presence / absence data including similarity matrices and reshaping.

Hypothesis testing

  • ICSNP provides Hotellings T2 test as well as a range of non-parametric tests including location tests based on marginal ranks, spatial median and spatial signs computation, estimates of shape. Non-parametric two sample tests are also available from cramer and spatial sign and rank tests to investigate location, sphericity and independence are available in SpatialNP.

Multivariate distributions

  • Descriptive measures: cov() and cor() in stats will provide estimates of the covariance and correlation matrices respectively. ICSNP offers several descriptive measures such as spatial.median() which provides an estimate of the spatial median and further functions which provide estimates of scatter. Further robust methods are provided such as cov.rob() in MASS which provides robust estimates of the variance-covariance matrix by minimum volume ellipsoid, minimum covariance determinant or classical product-moment. covRobust provides robust covariance estimation via nearest neighbor variance estimation. robustbase provides robust covariance estimation via fast minimum covariance determinant with covMCD() and the Orthogonalized pairwise estimate of Gnanadesikan-Kettenring via covOGK(). Scalable robust methods are provided within rrcov also using fast minimum covariance determinant with covMcd() as well as M-estimators with covMest(). corpcor provides shrinkage estimation of large scale covariance and (partial) correlation matrices.
  • Densities (estimation and simulation): mvnorm() in MASS simulates from the multivariate normal distribution. mvtnorm also provides simulation as well as probability and quantile functions for both the multivariate t distribution and multivariate normal distributions as well as density functions for the multivariate normal distribution. mnormt provides multivariate normal and multivariate t density and distribution functions as well as random number simulation. sn provides density, distribution and random number generation for the multivariate skew normal and skew t distribution. delt provides a range of functions for estimating multivariate densities by CART and greedy methods. Comprehensive information on mixtures is given in the Cluster view, some density estimates and random numbers are provided by rmvnorm.mixt() and dmvnorm.mixt() in ks, mixture fitting is also provided within bayesm. Functions to simulate from the Wishart distribution are provided in a number of places, such as rwishart() in bayesm and rwish() in MCMCpack (the latter also has a density function dwish()). bkde2D() from KernSmooth and kde2d() from MASS provide binned and non-binned 2-dimensional kernel density estimation, ks also provides multivariate kernel smoothing as does ash and GenKern. prim provides patient rule induction methods to attempt to find regions of high density in high dimensional multivariate data, feature also provides methods for determining feature significance in multivariate data (such as in relation to local modes).
  • Assessing normality: mvnormtest provides a multivariate extension to the Shapiro-Wilks test, mvoutlier provides multivariate outlier detection based on robust methods. ICS provides tests for multi-normality. mvnorm.etest() in energy provides an assessment of normality based on E statistics (energy); in the same package k.sample() assesses a number of samples for equal distributions. Tests for Wishart-distributed covariance matrices are given by mauchly.test() in stats.
  • Copulas:copula provides routines for a range of (elliptical and archimedean) copulas including normal, t, Clayton, Frank, Gumbel, fgac provides generalised archimedian copula.

Linear models

  • From stats, lm() (with a matrix specified as the dependent variable) offers multivariate linear models, anova.mlm() provides comparison of multivariate linear models. manova() offers MANOVA. sn provides msn.mle() and mst.mle() which fit multivariate skew normal and multivariate skew t models.pls provides partial least squares regression (PLSR) and principal component regression, ppls provides penalized partial least squares, dr provides dimension reduction regression options such as "sir" (sliced inverse regression), "save" (sliced average variance estimation). plsgenomics provides partial least squares analyses for genomics. relaimpo provides functions to investigate the relative importance of regression parameters.

Projection methods

  • Principal components: these can be fitted with prcomp() (based on svd(), preferred) as well as princomp() (based on eigen() for compatibility with S-PLUS) from stats. pc1() in Hmisc provides the first principal component and gives coefficients for unscaled data. Additional support for an assessment of the scree plot can be found in nFactors, whereas paran provides routines for Horn's evaluation of the number of dimensions to retain. For wide matrices, gmodels provides fast.prcomp() and fast.svd(). kernlab uses kernel methods to provide a form of non-linear principal components with kpca(). pcaPP provides robust principal components by means of projection pursuit. amap provides further robust and parallelised methods such as a form of generalised and robust principal component analysis via acpgen() and acprob() respectively. Further options for principal components in an ecological setting are available within ade4 and in a sensory setting in SensoMineR. psy provides a variety of routines useful in psychometry, in this context these include sphpca() which maps onto a sphere and fpca() where some variables may be considered as dependent as well as scree.plot() which has the option of adding simulation results to help assess the observed data. PTAk provides principal tensor analysis analagous to both PCA and correspondence analysis. smatr provides standardised major axis estimation with specific application to allometry.
  • Canonical Correlation: cancor() in stats provides canonical correlation. kernlab uses kernel methods to provide robust canonical correlation with kcca(). concor provides a number of concordance methods.
  • Redundancy Analysis: calibrate provides rda() for redundancy analysis as well as further options for canonical correlation. fso provides fuzzy set ordination, which extends ordination beyond methods available from linear algebra.
  • Independent Components: fastICA provides fastICA algorithms to perform independent component analysis (ICA) and Projection Pursuit, and PearsonICA uses score functions. ICS provides either an invariant co-ordinate system or independent components. JADE adds an interface to the JADE algorithm, as well as providing some diagnostics for ICA.
  • Procrustes analysis: procrustes() in vegan provides procrustes analysis, this package also provides functions for ordination and further information on that area is given in the Environmetrics task view. Generalised procrustes analysis via GPA() is available from FactoMineR.

Principal coordinates / scaling methods

  • cmdscale() in stats provides classical multidimensional scaling (principal coordinates analysis), sammon() and isoMDS() in MASS offer Sammon and Kruskal's non-metric multidimensional scaling. vegan provides wrappers and post-processing for non-metric MDS. indscal() is provided by SensoMineR.

Unsupervised classification

  • Cluster analysis: A comprehensive overview of clustering methods available within R is provided by the Cluster task view. Standard techniques include hierarchical clustering by hclust() and k-means clustering by kmeans() in stats. A range of established clustering and visualisation techniques are also available in cluster, some cluster validation routines are available in clv and the Rand index can be computed from classAgreement() in e1071. Trimmed cluster analysis is available from trimcluster, cluster ensembles are available from clue, methods to assist with choice of routines are available in clusterSim and hybrid methodology is provided by hybridHclust. Distance measures (edist()) and hierarchical clustering (hclust.energy()) based on E-statistics are available in energy. Mahalanobis distance based clustering (for fixed points as well as clusterwise regression) are available from fpc. clustvarsel provides variable selection within model-based clustering. Fuzzy clustering is available within cluster as well as via the hopach (Hierarchical Ordered Partitioning and Collapsing Hybrid) algorithm. kohonen provides supervised and unsupervised SOMs for high dimensional spectra or patterns. clusterGeneration helps simulate clusters. The Environmetrics task view also gives a topic-related overview of some clustering techniques. Model based clustering is available in mclust.
  • Tree methods: Full details on tree methods are given in the MachineLearning task view. Suffice to say here that classification trees are sometimes considered within multivariate methods; rpart is most used for this purpose. party provides recursive partitioning. Classification and regression training is provided by caret. kknn provides k-nearest neighbour methods which can be used for regression as well as classification.

Supervised classification and discriminant analysis

  • lda() and qda() within MASS provide linear and quadratic discrimination respectively. mda provides mixture and flexible discriminant analysis with mda() and fda() as well as multivariate adaptive regression splines with mars() and adaptive spline backfitting with the bruto() function. Multivariate adaptive regression splines can also be found in earth. Package class provides k-nearest neighbours by knn(), knncat provides k-nearest neighbours for categorical variables. SensoMineR provides FDA() for factorial discriminant analysis. A number of packages provide for dimension reduction with the classification. klaR includes variable selection and robustness against multicollinearity as well as a number of visualisation routines. superpc provides principal components for supervised classification, whereas gpls provides classification using generalised partial least squares. hddplot provides cross-validated linear discriminant calculations to determine the optimum number of features. ROCR provides a range of methods for assessing classifier performance. Further information on supervised classification can be found in the MachineLearning task view.

Correspondence analysis

  • corresp() and mca() in MASS provide simple and multiple correspondence analysis respectively. ca also provides single, multiple and joint correspondence analysis. ca() and mca() in ade4 provide correspondence and multiple correspondence analysis respectively, as well as adding homogeneous table analysis with hta(). Further functionality is also available within vegan co-correspondence is available from cocorresp. FactoMineR provides CA() and MCA() which also enable simple and multiple correspondence analysis as well as associated graphical routines. homals provides homogeneity analysis.

Missing data

  • mitools provides tools for multiple imputation, mice provides multivariate imputation by chained equations mvnmle provides ML estimation for multivariate normal data with missing values, mix provides multiple imputation for mixed categorical and continuous data. pan provides multiple imputation for missing panel data. VIM provides methods for the visualisation as well as imputation of missing data. aregImpute() and transcan() from Hmisc provide further imputation methods. monomvn deals with estimation models where the missing data pattern is monotone.

Latent variable approaches

  • factanal() in stats provides factor analysis by maximum likelihood, Bayesian factor analysis is provided for Gaussian, ordinal and mixed variables in MCMCpack. GPArotation offers GPA (gradient projection algorithm) factor rotation. sem fits linear structural equation models and ltm provides latent trait models under item response theory and range of extensions to Rasch models can be found in eRm. FactoMineR provides a wide range of Factor Analysis methods, including MFA() and HMFA()for multiple and hierarchical multiple factor analysis as well as ADFM() for multiple factor analysis of quantitative and qualitative data. tsfa provides factor analysis for time series. poLCA provides latent class and latent class regression models for a variety of outcome variables.

Modelling non-Gaussian data

  • MNP provides Bayesian multinomial probit models, polycor provides polychoric and tetrachoric correlation matrices. bayesm provides a range of models such as seemingly unrelated regression, multinomial logit/probit, multivariate probit and instrumental variables. VGAM provides Vector Generalised Linear and Additive Models, Reduced Rank regression

Matrix manipulations

  • As a vector- and matrix-based language, base R ships with many powerful tools for doing matrix manipulations, which are complemented by the packages Matrix and SparseM. matrixcalc adds functions for matrix differential calculus. Some further sparse matrix functionality is also available from spam.

Miscellaneous utilities

  • abind generalises cbind() and rbind() for arrays, mApply() in Hmisc generalises apply() for matrices and passes multiple functions. In addition to functions listed earlier, sn provides operations such as marginalisation, affine transformations and graphics for the multivariate skew normal and skew t distribution. mAr provides for vector auto-regression. rm.boot() from Hmisc bootstraps repeated measures models. psy also provides a range of statistics based on Cohen's kappa including weighted measures and agreement among more than 2 raters. cwhmisc contains a number of interesting support functions which are of interest, such as ellipse(), normalise() and various rotation functions. desirability provides functions for multivariate optimisation. geozoo provides plotting of geometric objects in GGobi.

Packages

abind — 1.4-5

Combine Multidimensional Arrays

ade4 — 1.7-13

Analysis of Ecological Data: Exploratory and Euclidean Methods in Environmental Sciences

amap — 0.8-17

Another Multidimensional Analysis Package

aplpack — 1.3.2

Another Plot Package: 'Bagplots', 'Iconplots', 'Summaryplots', Slider Functions and Others

ash — 1.0-15

David Scott's ASH Routines

bayesm — 3.1-2

Bayesian Inference for Marketing/Micro-Econometrics

calibrate — 1.7.2

Calibration of Scatterplot and Biplot Axes

ca — 0.71

Simple, Multiple and Joint Correspondence Analysis

car — 3.0-3

Companion to Applied Regression

caret — 6.0-84

Classification and Regression Training

class — 7.3-15

Functions for Classification

clue — 0.3-57

Cluster Ensembles

cluster — 2.1.0

"Finding Groups in Data": Cluster Analysis Extended Rousseeuw et al.

clusterGeneration — 1.3.4

Random Cluster Generation (with Specified Degree of Separation)

clusterSim — 0.47-4

Searching for Optimal Clustering Procedure for a Data Set

clustvarsel — 2.3.3

Variable Selection for Gaussian Model-Based Clustering

clv — 0.3-2.1

Cluster Validation Techniques

cocorresp — 0.4-0

Co-Correspondence Analysis Methods

concor — 1.0-0.1

Concordance

copula — 0.999-19.1

Multivariate Dependence with Copulas

corpcor — 1.6.9

Efficient Estimation of Covariance and (Partial) Correlation

covRobust — 1.1-3

Robust Covariance Estimation via Nearest Neighbor Cleaning

cramer — 0.9-3

Multivariate Nonparametric Cramer-Test for the Two-Sample-Problem

cwhmisc — 6.6

Miscellaneous Functions for Math, Plotting, Printing, Statistics, Strings, and Tools

delt — 0.8.2

Estimation of Multivariate Densities Using Adaptive Partitions

denpro — 0.9.2

Visualization of Multivariate Functions, Sets, and Data

desirability — 2.1

Function Optimization and Ranking via Desirability Functions

dr — 3.0.10

Methods for Dimension Reduction for Regression

e1071 — 1.7-2

Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien

earth — 5.1.1

Multivariate Adaptive Regression Splines

ellipse — 0.4.1

Functions for Drawing Ellipses and Ellipse-Like Confidence Regions

energy — 1.7-6

E-Statistics: Multivariate Inference via the Energy of Data

eRm — 1.0-0

Extended Rasch Modeling

FactoMineR — 1.42

Multivariate Exploratory Data Analysis and Data Mining

fastICA — 1.2-2

FastICA Algorithms to Perform ICA and Projection Pursuit

feature — 1.2.13

Local Inferential Feature Significance for Multivariate Kernel Density Estimation

fgac — 0.6-1

Generalized Archimedean Copula

fpc — 2.2-3

Flexible Procedures for Clustering

fso — 2.1-1

Fuzzy Set Ordination

gclus — 1.3.2

Clustering Graphics

GenKern — 1.2-60

Functions for generating and manipulating binned kernel density estimates

geometry — 0.4.2

Mesh Generation and Surface Tessellation

geozoo — 0.5.1

Zoo of Geometric Objects

gmodels — 2.18.1

Various R Programming Tools for Model Fitting

GPArotation — 2014.11-1

GPA Factor Rotation

hddplot — 0.59

Use Known Groups in High-Dimensional Data to Derive Scores for Plots

Hmisc — 4.2-0

Harrell Miscellaneous

homals — 1.0-8

Gifi Methods for Optimal Scaling

hybridHclust — 1.0-5

Hybrid Hierarchical Clustering

ICS — 1.3-1

Tools for Exploring Multivariate Data via ICS/ICA

ICSNP — 1.1-1

Tools for Multivariate Nonparametrics

iplots — 1.1-7.1

iPlots - interactive graphics for R

JADE — 2.0-1

Blind Source Separation Methods Based on Joint Diagonalization and Some BSS Performance Criteria

kernlab — 0.9-27

Kernel-Based Machine Learning Lab

KernSmooth — 2.23-15

Functions for Kernel Smoothing Supporting Wand & Jones (1995)

kknn — 1.3.1

Weighted k-Nearest Neighbors

klaR — 0.6-14

Classification and Visualization

knncat — 1.2.2

Nearest-neighbor Classification with Categorical Variables

kohonen — 3.0.8

Supervised and Unsupervised Self-Organising Maps

ks — 1.11.5

Kernel Smoothing

lattice — 0.20-38

Trellis Graphics for R

ltm — 1.1-1

Latent Trait Models under IRT

mAr — 1.1-2

Multivariate AutoRegressive analysis

MASS — 7.3-51.4

Support Functions and Datasets for Venables and Ripley's MASS

matrixcalc — 1.0-3

Collection of functions for matrix calculations

Matrix — 1.2-17

Sparse and Dense Matrix Classes and Methods

MCMCpack — 1.4-4

Markov Chain Monte Carlo (MCMC) Package

mclust — 5.4.5

Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

mda — 0.4-10

Mixture and Flexible Discriminant Analysis

mice — 3.6.0

Multivariate Imputation by Chained Equations

misc3d — 0.8-4

Miscellaneous 3D Plots

mitools — 2.4

Tools for Multiple Imputation of Missing Data

mix — 1.0-10

Estimation/Multiple Imputation for Mixed Categorical and Continuous Data

monomvn — 1.9-10

Estimation for Multivariate Normal and Student-t Data with Monotone Missingness

mnormt — 1.5-5

The Multivariate Normal and t Distributions

MNP — 3.1-0

R Package for Fitting the Multinomial Probit Model

mvnmle — 0.1-11.1

ML Estimation for Multivariate Normal Data with Missing Values

mvnormtest — 0.1-9

Normality test for multivariate variables

mvoutlier — 2.0.9

Multivariate Outlier Detection Based on Robust Methods

mvtnorm — 1.0-11

Multivariate Normal and t Distributions

nFactors — 2.3.3

Parallel Analysis and Non Graphical Solutions to the Cattell Scree Test

pan — 1.6

Multiple Imputation for Multivariate Panel or Clustered Data

paran — 1.5.2

Horn's Test of Principal Components/Factors

party — 1.3-3

A Laboratory for Recursive Partytioning

pcaPP — 1.9-73

Robust PCA by Projection Pursuit

PearsonICA — 1.2-4

Independent component analysis using score functions from the Pearson system

poLCA — 1.4.1

Polytomous variable Latent Class Analysis

polycor — 0.7-9

Polychoric and Polyserial Correlations

plsgenomics — 1.5-2

PLS Analyses for Genomics

pls — 2.7-1

Partial Least Squares and Principal Component Regression

ppls — 1.6-1.1

Penalized Partial Least Squares

prim — 1.0.16

Patient Rule Induction Method (PRIM)

proxy — 0.4-23

Distance and Similarity Measures

psy — 1.1

Various procedures used in psychometry

PTAk — 1.3-34

Principal Tensor Analysis on k Modes

relaimpo — 2.2-3

Relative Importance of Regressors in Linear Models

rgl — 0.100.26

3D Visualization Using OpenGL

rggobi — 2.1.22

Interface Between R and 'GGobi'

robustbase — 0.93-5

Basic Robust Statistics

ROCR — 1.0-7

Visualizing the Performance of Scoring Classifiers

rpart — 4.1-15

Recursive Partitioning and Regression Trees

rrcov — 1.4-7

Scalable Robust Estimators with High Breakdown Point

scatterplot3d — 0.3-41

3D Scatter Plot

sem — 3.1-9

Structural Equation Models

SensoMineR — 1.23

Sensory Data Analysis

seriation — 1.2-7

Infrastructure for Ordering Objects Using Seriation

simba — 0.3-5

A Collection of functions for similarity analysis of vegetation data

smatr — 3.4-8

(Standardised) Major Axis Estimation and Testing Routines

sn — 1.5-4

The Skew-Normal and Related Distributions Such as the Skew-t

spam — 2.2-2

SPArse Matrix

SparseM — 1.77

Sparse Linear Algebra

SpatialNP — 1.1-3

Multivariate Nonparametric Methods Based on Spatial Signs and Ranks

superpc — 1.09

Supervised principal components

trimcluster — 0.1-2.1

Cluster Analysis with Trimming

tsfa — 2014.10-1

Time Series Factor Analysis

vegan — 2.5-5

Community Ecology Package

vcd — 1.4-4

Visualizing Categorical Data

VGAM — 1.1-1

Vector Generalized Linear and Additive Models

VIM — 4.8.0

Visualization and Imputation of Missing Values

xgobi — 1.2-15

Interface to the XGobi and XGvis programs for graphical data analysis

YaleToolkit — 4.2.2

Data exploration tools from Yale University.


Task view list