Last updated on 2020-05-20 by Paul Hewson
Base R contains most of the functionality for classical multivariate analysis, somewhere. There are a large number of packages on CRAN which extend this methodology, a brief overview is given below. Application-specific uses of multivariate statistics are described in relevant task views, for example whilst principal components are listed here, ordination is covered in the Environmetrics task view. Further information on supervised classification can be found in the MachineLearning task view, and unsupervised classification in the Cluster task view.
The packages in this view can be roughly structured into the following topics. If you think that some package is missing from the list, please let me know.
Visualising multivariate data
coplot()) and lattice functions (e.g.
splom()) are useful for visualising pairwise arrays of 2-dimensional scatterplots, clouds and 3-dimensional densities.
scatterplot.matrixin the car provides usefully enhanced pairwise scatterplots. Beyond this, scatterplot3d provides 3 dimensional scatterplots, aplpack provides bagplots and
spin3R(), a function for rotating 3d clouds. misc3d, dependent upon rgl, provides animated functions within R useful for visualising densities. YaleToolkit provides a range of useful visualisation techniques for multivariate data. More specialised multivariate plots include the following:
faces()in aplpack provides Chernoff's faces;
parcoord()from MASS provides parallel coordinate plots;
stars()in graphics provides a choice of star, radar and cobweb plots respectively.
mstree()in ade4 and
spantree()in vegan provide minimum spanning tree functionality. calibrate supports biplot and scatterplot axis labelling. geometry, which provides an interface to the qhull library, gives indices to the relevant points via
convexhulln(). ellipse draws ellipses for two parameters, and provides
plotcorr(), visual display of a correlation matrix. denpro provides level set trees for multivariate visualisation. Mosaic plots are available via
mosaicplot()in graphics and
mosaic()in vcd that also contains other visualization techniques for multivariate categorical data. gclus provides a number of cluster specific graphical enhancements for scatterplots and parallel coordinate plots See the links for a reference to GGobi. rggobi interfaces with GGobi. xgobi interfaces to the XGobi and XGvis programs which allow linked, dynamic multivariate plots as well as projection pursuit. Finally, iplots allows particularly powerful dynamic interactive graphics, of which interactive parallel co-ordinate plots and mosaic plots may be of great interest. Seriation methods are provided by seriation which can reorder matrices and dendrograms.
summary.formula()in Hmisc assist with descriptive functions; from the same package
varclus()offers variable clustering while
find.matches()assist in exploring a given dataset in terms of representativeness and finding matches. Whilst
dist()in base and
daisy()in cluster provide a wide range of distance measures, proxy provides a framework for more distance measures, including measures between matrices. simba provides functions for dealing with presence / absence data including similarity matrices and reshaping.
cor()in stats will provide estimates of the covariance and correlation matrices respectively. ICSNP offers several descriptive measures such as
spatial.median()which provides an estimate of the spatial median and further functions which provide estimates of scatter. Further robust methods are provided such as
cov.rob()in MASS which provides robust estimates of the variance-covariance matrix by minimum volume ellipsoid, minimum covariance determinant or classical product-moment. covRobust provides robust covariance estimation via nearest neighbor variance estimation. robustbase provides robust covariance estimation via fast minimum covariance determinant with
covMCD()and the Orthogonalized pairwise estimate of Gnanadesikan-Kettenring via
covOGK(). Scalable robust methods are provided within rrcov also using fast minimum covariance determinant with
covMcd()as well as M-estimators with
covMest(). corpcor provides shrinkage estimation of large scale covariance and (partial) correlation matrices.
mvnorm()in MASS simulates from the multivariate normal distribution. mvtnorm also provides simulation as well as probability and quantile functions for both the multivariate t distribution and multivariate normal distributions as well as density functions for the multivariate normal distribution. mnormt provides multivariate normal and multivariate t density and distribution functions as well as random number simulation. sn provides density, distribution and random number generation for the multivariate skew normal and skew t distribution. delt provides a range of functions for estimating multivariate densities by CART and greedy methods. Comprehensive information on mixtures is given in the Cluster view, some density estimates and random numbers are provided by
dmvnorm.mixt()in ks, mixture fitting is also provided within bayesm. Functions to simulate from the Wishart distribution are provided in a number of places, such as
rwishart()in bayesm and
rwish()in MCMCpack (the latter also has a density function
bkde2D()from KernSmooth and
kde2d()from MASS provide binned and non-binned 2-dimensional kernel density estimation, ks also provides multivariate kernel smoothing as does ash and GenKern. prim provides patient rule induction methods to attempt to find regions of high density in high dimensional multivariate data, feature also provides methods for determining feature significance in multivariate data (such as in relation to local modes).
mvnorm.etest()in energy provides an assessment of normality based on E statistics (energy); in the same package
k.sample()assesses a number of samples for equal distributions. Tests for Wishart-distributed covariance matrices are given by
lm()(with a matrix specified as the dependent variable) offers multivariate linear models,
anova.mlm()provides comparison of multivariate linear models.
manova()offers MANOVA. sn provides
mst.mle()which fit multivariate skew normal and multivariate skew t models.pls provides partial least squares regression (PLSR) and principal component regression, dr provides dimension reduction regression options such as
"sir"(sliced inverse regression),
"save"(sliced average variance estimation). plsgenomics provides partial least squares analyses for genomics. relaimpo provides functions to investigate the relative importance of regression parameters.
svd(), preferred) as well as
eigen()for compatibility with S-PLUS) from stats.
pc1()in Hmisc provides the first principal component and gives coefficients for unscaled data. Additional support for an assessment of the scree plot can be found in nFactors, whereas paran provides routines for Horn's evaluation of the number of dimensions to retain. For wide matrices, gmodels provides
fast.svd(). kernlab uses kernel methods to provide a form of non-linear principal components with
kpca(). pcaPP provides robust principal components by means of projection pursuit. amap provides further robust and parallelised methods such as a form of generalised and robust principal component analysis via
acprob()respectively. Further options for principal components in an ecological setting are available within ade4 and in a sensory setting in SensoMineR. psy provides a variety of routines useful in psychometry, in this context these include
sphpca()which maps onto a sphere and
fpca()where some variables may be considered as dependent as well as
scree.plot()which has the option of adding simulation results to help assess the observed data. PTAk provides principal tensor analysis analagous to both PCA and correspondence analysis. smatr provides standardised major axis estimation with specific application to allometry.
cancor()in stats provides canonical correlation. kernlab uses kernel methods to provide robust canonical correlation with
kcca(). concor provides a number of concordance methods.
rda()for redundancy analysis as well as further options for canonical correlation. fso provides fuzzy set ordination, which extends ordination beyond methods available from linear algebra.
procrustes()in vegan provides procrustes analysis, this package also provides functions for ordination and further information on that area is given in the Environmetrics task view. Generalised procrustes analysis via
GPA()is available from FactoMineR.
Principal coordinates / scaling methods
cmdscale()in stats provides classical multidimensional scaling (principal coordinates analysis),
isoMDS()in MASS offer Sammon and Kruskal's non-metric multidimensional scaling. vegan provides wrappers and post-processing for non-metric MDS.
indscal()is provided by SensoMineR.
hclust()and k-means clustering by
kmeans()in stats. A range of established clustering and visualisation techniques are also available in cluster, some cluster validation routines are available in clv and the Rand index can be computed from
classAgreement()in e1071. Cluster ensembles are available from clue, methods to assist with choice of routines are available in clusterSim. Distance measures (
edist()) and hierarchical clustering (
hclust.energy()) based on E-statistics are available in energy. Mahalanobis distance based clustering (for fixed points as well as clusterwise regression) are available from fpc. clustvarsel provides variable selection within model-based clustering. Fuzzy clustering is available within cluster as well as via the hopach (Hierarchical Ordered Partitioning and Collapsing Hybrid) algorithm. kohonen provides supervised and unsupervised SOMs for high dimensional spectra or patterns. clusterGeneration helps simulate clusters. The Environmetrics task view also gives a topic-related overview of some clustering techniques. Model based clustering is available in mclust.
Supervised classification and discriminant analysis
qda()within MASS provide linear and quadratic discrimination respectively. mda provides mixture and flexible discriminant analysis with
fda()as well as multivariate adaptive regression splines with
mars()and adaptive spline backfitting with the
bruto()function. Multivariate adaptive regression splines can also be found in earth. Package class provides k-nearest neighbours by
knn(). SensoMineR provides
FDA()for factorial discriminant analysis. A number of packages provide for dimension reduction with the classification. klaR includes variable selection and robustness against multicollinearity as well as a number of visualisation routines. superpc provides principal components for supervised classification, whereas gpls provides classification using generalised partial least squares. hddplot provides cross-validated linear discriminant calculations to determine the optimum number of features. ROCR provides a range of methods for assessing classifier performance. Further information on supervised classification can be found in the MachineLearning task view.
mca()in MASS provide simple and multiple correspondence analysis respectively. ca also provides single, multiple and joint correspondence analysis.
mca()in ade4 provide correspondence and multiple correspondence analysis respectively, as well as adding homogeneous table analysis with
hta(). Further functionality is also available within vegan co-correspondence is available from cocorresp. FactoMineR provides
MCA()which also enable simple and multiple correspondence analysis as well as associated graphical routines. homals provides homogeneity analysis.
transcan()from Hmisc provide further imputation methods.
Latent variable approaches
factanal()in stats provides factor analysis by maximum likelihood, Bayesian factor analysis is provided for Gaussian, ordinal and mixed variables in MCMCpack. GPArotation offers GPA (gradient projection algorithm) factor rotation. sem fits linear structural equation models and ltm provides latent trait models under item response theory and range of extensions to Rasch models can be found in eRm. FactoMineR provides a wide range of Factor Analysis methods, including
HMFA()for multiple and hierarchical multiple factor analysis as well as
ADFM()for multiple factor analysis of quantitative and qualitative data. tsfa provides factor analysis for time series. poLCA provides latent class and latent class regression models for a variety of outcome variables.
Modelling non-Gaussian data
mApply()in Hmisc generalises
apply()for matrices and passes multiple functions. In addition to functions listed earlier, sn provides operations such as marginalisation, affine transformations and graphics for the multivariate skew normal and skew t distribution. mAr provides for vector auto-regression.
rm.boot()from Hmisc bootstraps repeated measures models. psy also provides a range of statistics based on Cohen's kappa including weighted measures and agreement among more than 2 raters. cwhmisc contains a number of interesting support functions which are of interest, such as
normalise()and various rotation functions. desirability provides functions for multivariate optimisation. geozoo provides plotting of geometric objects in GGobi.