Provides some easy-to-use functions to extract and visualize the output of multivariate data analyses, including 'PCA' (Principal Component Analysis), 'CA' (Correspondence Analysis), 'MCA' (Multiple Correspondence Analysis), 'FAMD' (Factor Analysis of Mixed Data), 'MFA' (Multiple Factor Analysis) and 'HMFA' (Hierarchical Multiple Factor Analysis) functions from different R packages. It contains also functions for simplifying some clustering analysis steps and provides 'ggplot2' - based elegant data visualization.
factoextra is an R package making easy to extract and visualize the output of exploratory multivariate data analyses, including:
Principal Component Analysis (PCA), which is used to summarize the information contained in a continuous (i.e, quantitative) multivariate data by reducing the dimensionality of the data without loosing important information.
Correspondence Analysis (CA), which is an extension of the principal component analysis suited to analyse a large contingency table formed by two qualitative variables (or categorical data).
Multiple Correspondence Analysis (MCA), which is an adaptation of CA to a data table containing more than two categorical variables.
Multiple Factor Analysis (MFA) dedicated to datasets where variables are organized into groups (qualitative and/or quantitative variables).
Hierarchical Multiple Factor Analysis (HMFA): An extension of MFA in a situation where the data are organized into a hierarchical structure.
Factor Analysis of Mixed Data (FAMD), a particular case of the MFA, dedicated to analyze a data set containing both quantitative and qualitative variables.
There are a number of R packages implementing principal component methods. These packages include: FactoMineR, ade4, stats, ca, MASS and ExPosition.
However, the result is presented differently according to the used packages. To help in the interpretation and in the visualization of multivariate analysis - such as cluster analysis and dimensionality reduction analysis - we developed an easy-to-use R package named factoextra.
The figure below shows methods, which outputs can be visualized using the factoextra package. The official online documentation is available at: http://www.sthda.com/english/rpkgs/factoextra.
The factoextra R package can handle the results of PCA, CA, MCA, MFA, FAMD and HMFA from several packages, for extracting and visualizing the most important information contained in your data.
After PCA, CA, MCA, MFA, FAMD and HMFA, the most important row/column elements can be highlighted using :
If you want to do this, the factoextra package provides a convenient solution.
If you want to make predictions with PCA/MCA and to visualize the position of the supplementary variables/individuals on the factor map using ggplot2: then factoextra can help you. It's quick, write less and do more...
No matter the package you decided to use, factoextra can give you a human understandable output.
The FactoMineR package can be installed and loaded as follow:
install.packages("FactoMineR")# Loadlibrary("FactoMineR")
install.packages("factoextra")
if(!require(devtools)) install.packages("devtools")devtools::install_github("kassambara/factoextra")
library("factoextra")#> Loading required package: ggplot2#> Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
See the online documentation (http://www.sthda.com/english/rpkgs/factoextra) for a complete list.
Functions | Description |
---|---|
fviz_eig (or fviz_eigenvalue) | Extract and visualize the eigenvalues/variances of dimensions. |
fviz_pca | Graph of individuals/variables from the output of Principal Component Analysis (PCA). |
fviz_ca | Graph of column/row variables from the output of Correspondence Analysis (CA). |
fviz_mca | Graph of individuals/variables from the output of Multiple Correspondence Analysis (MCA). |
fviz_mfa | Graph of individuals/variables from the output of Multiple Factor Analysis (MFA). |
fviz_famd | Graph of individuals/variables from the output of Factor Analysis of Mixed Data (FAMD). |
fviz_hmfa | Graph of individuals/variables from the output of Hierarchical Multiple Factor Analysis (HMFA). |
fviz_ellipses | Draw confidence ellipses around the categories. |
fviz_cos2 | Visualize the quality of representation of the row/column variable from the results of PCA, CA, MCA functions. |
fviz_contrib | Visualize the contributions of row/column elements from the results of PCA, CA, MCA functions. |
Functions | Description |
---|---|
get_eigenvalue | Extract and visualize the eigenvalues/variances of dimensions. |
get_pca | Extract all the results (coordinates, squared cosine, contributions) for the active individuals/variables from Principal Component Analysis (PCA) outputs. |
get_ca | Extract all the results (coordinates, squared cosine, contributions) for the active column/row variables from Correspondence Analysis outputs. |
get_mca | Extract results from Multiple Correspondence Analysis outputs. |
get_mfa | Extract results from Multiple Factor Analysis outputs. |
get_famd | Extract results from Factor Analysis of Mixed Data outputs. |
get_hmfa | Extract results from Hierarchical Multiple Factor Analysis outputs. |
facto_summarize | Subset and summarize the output of factor analyses. |
Functions | Description |
---|---|
dist(fviz_dist, get_dist) | Enhanced Distance Matrix Computation and Visualization. |
get_clust_tendency | Assessing Clustering Tendency. |
fviz_nbclust(fviz_gap_stat) | Determining and Visualizing the Optimal Number of Clusters. |
fviz_dend | Enhanced Visualization of Dendrogram |
fviz_cluster | Visualize Clustering Results |
fviz_mclust | Visualize Model-based Clustering Results |
fviz_silhouette | Visualize Silhouette Information from Clustering. |
hcut | Computes Hierarchical Clustering and Cut the Tree |
hkmeans (hkmeans_tree, print.hkmeans) | Hierarchical k-means clustering. |
eclust | Visual enhancement of clustering analysis |
As depicted in the figure below, the type of analysis to be performed depends on the data set formats and structures.
In this section we start by illustrating classical methods - such as PCA, CA and MCA - for analyzing a data set containing continuous variables, contingency table and qualitative variables, respectively.
We continue by discussing advanced methods - such as FAMD, MFA and HMFA - for analyzing a data set containing a mix of variables (qualitatives & quantitatives) organized or not into groups.
Finally, we show how to perform hierarchical clustering on principal components (HCPC), which useful for performing clustering with a data set containing only qualitative variables or with a mixed data of qualitative and quantitative variables.
Read more about computing and interpreting principal component analysis at: Principal Component Analysis (PCA).
library("factoextra")data("decathlon2")df <- decathlon2[1:23, 1:10]
library("FactoMineR")res.pca <- PCA(df, graph = FALSE)
# Extract eigenvalues/variancesget_eig(res.pca)#> eigenvalue variance.percent cumulative.variance.percent#> Dim.1 4.1242133 41.242133 41.24213#> Dim.2 1.8385309 18.385309 59.62744#> Dim.3 1.2391403 12.391403 72.01885#> Dim.4 0.8194402 8.194402 80.21325#> Dim.5 0.7015528 7.015528 87.22878#> Dim.6 0.4228828 4.228828 91.45760#> Dim.7 0.3025817 3.025817 94.48342#> Dim.8 0.2744700 2.744700 97.22812#> Dim.9 0.1552169 1.552169 98.78029#> Dim.10 0.1219710 1.219710 100.00000# Visualize eigenvalues/variancesfviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 50))
4.Extract and visualize results for variables:
# Extract the results for variablesvar <- get_pca_var(res.pca)var#> Principal Component Analysis Results for variables#> ===================================================#> Name Description#> 1 "$coord" "Coordinates for the variables"#> 2 "$cor" "Correlations between variables and dimensions"#> 3 "$cos2" "Cos2 for the variables"#> 4 "$contrib" "contributions of the variables"# Coordinates of variableshead(var$coord)#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5#> X100m -0.8506257 -0.17939806 0.3015564 0.03357320 -0.1944440#> Long.jump 0.7941806 0.28085695 -0.1905465 -0.11538956 0.2331567#> Shot.put 0.7339127 0.08540412 0.5175978 0.12846837 -0.2488129#> High.jump 0.6100840 -0.46521415 0.3300852 0.14455012 0.4027002#> X400m -0.7016034 0.29017826 0.2835329 0.43082552 0.1039085#> X110m.hurdle -0.7641252 -0.02474081 0.4488873 -0.01689589 0.2242200# Contribution of variableshead(var$contrib)#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5#> X100m 17.544293 1.7505098 7.338659 0.13755240 5.389252#> Long.jump 15.293168 4.2904162 2.930094 1.62485936 7.748815#> Shot.put 13.060137 0.3967224 21.620432 2.01407269 8.824401#> High.jump 9.024811 11.7715838 8.792888 2.54987951 23.115504#> X400m 11.935544 4.5799296 6.487636 22.65090599 1.539012#> X110m.hurdle 14.157544 0.0332933 16.261261 0.03483735 7.166193# Graph of variables: default plotfviz_pca_var(res.pca, col.var = "black")
It's possible to control variable colors using their contributions ("contrib") to the principal axes:
# Control variable colors using their contributionsfviz_pca_var(res.pca, col.var="contrib",gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = TRUE # Avoid text overlapping)
# Contributions of variables to PC1fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)# Contributions of variables to PC2fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
# Extract the results for individualsind <- get_pca_ind(res.pca)ind#> Principal Component Analysis Results for individuals#> ===================================================#> Name Description#> 1 "$coord" "Coordinates for the individuals"#> 2 "$cos2" "Cos2 for the individuals"#> 3 "$contrib" "contributions of the individuals"# Coordinates of individualshead(ind$coord)#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5#> SEBRLE 0.1955047 1.5890567 0.6424912 0.08389652 1.16829387#> CLAY 0.8078795 2.4748137 -1.3873827 1.29838232 -0.82498206#> BERNARD -1.3591340 1.6480950 0.2005584 -1.96409420 0.08419345#> YURKOV -0.8889532 -0.4426067 2.5295843 0.71290837 0.40782264#> ZSIVOCZKY -0.1081216 -2.0688377 -1.3342591 -0.10152796 -0.20145217#> McMULLEN 0.1212195 -1.0139102 -0.8625170 1.34164291 1.62151286# Graph of individuals# 1. Use repel = TRUE to avoid overplotting# 2. Control automatically the color of individuals using the cos2# cos2 = the quality of the individuals on the factor map# Use points only# 3. Use gradient colorfviz_pca_ind(res.pca, col.ind = "cos2",gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = TRUE # Avoid text overlapping (slow if many points))
# Biplot of individuals and variablesfviz_pca_biplot(res.pca, repel = TRUE)
# Compute PCA on the iris data set# The variable Species (index = 5) is removed# before PCA analysisiris.pca <- PCA(iris[,-5], graph = FALSE)# Visualize# Use habillage to specify groups for coloringfviz_pca_ind(iris.pca,label = "none", # hide individual labelshabillage = iris$Species, # color by groupspalette = c("#00AFBB", "#E7B800", "#FC4E07"),addEllipses = TRUE # Concentration ellipses)
Read more about computing and interpreting correspondence analysis at: Correspondence Analysis (CA).
# Loading datadata("housetasks")# Computing CAlibrary("FactoMineR")res.ca <- CA(housetasks, graph = FALSE)
# Result for row variablesget_ca_row(res.ca)# Result for column variablesget_ca_col(res.ca)
fviz_ca_biplot(res.ca, repel = TRUE)
To visualize only row points or column points, type this:
# Graph of row pointsfviz_ca_row(res.ca, repel = TRUE)# Graph of column pointsfviz_ca_col(res.ca)# Visualize row contributions on axes 1fviz_contrib(res.ca, choice ="row", axes = 1)# Visualize column contributions on axes 1fviz_contrib(res.ca, choice ="col", axes = 1)
Read more about computing and interpreting multiple correspondence analysis at: Multiple Correspondence Analysis (MCA).
library(FactoMineR)data(poison)res.mca <- MCA(poison, quanti.sup = 1:2,quali.sup = 3:4, graph=FALSE)
# Extract the results for variable categoriesget_mca_var(res.mca)# Extract the results for individualsget_mca_ind(res.mca)
# Visualize variable categorie contributions on axes 1fviz_contrib(res.mca, choice ="var", axes = 1)# Visualize individual contributions on axes 1# select the top 20fviz_contrib(res.mca, choice ="ind", axes = 1, top = 20)
# Color by groups# Add concentration ellipses# Use repel = TRUE to avoid overplottinggrp <- as.factor(poison[, "Vomiting"])fviz_mca_ind(res.mca, habillage = grp,addEllipses = TRUE, repel = TRUE)
fviz_mca_var(res.mca, repel = TRUE)
fviz_mca_biplot(res.mca, repel = TRUE)
The factoextra R package has also functions that support the visualization of advanced methods such:
To learn more about cluster analysis, you can refer to the book available at: Practical Guide to Cluster Analysis in R
The main parts of the book include:
The book presents the basic principles of these tasks and provide many examples in R. It offers solid guidance in data mining for students and researchers.
# 1. Loading and preparing datadata("USArrests")df <- scale(USArrests)# 2. Compute k-meansset.seed(123)km.res <- kmeans(scale(USArrests), 4, nstart = 25)# 3. Visualizelibrary("factoextra")fviz_cluster(km.res, data = df,palette = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"),ggtheme = theme_minimal(),main = "Partitioning Clustering Plot")
library("factoextra")# Compute hierarchical clustering and cut into 4 clustersres <- hcut(USArrests, k = 4, stand = TRUE)# Visualizefviz_dend(res, rect = TRUE, cex = 0.5,k_colors = c("#00AFBB","#2E9FDF", "#E7B800", "#FC4E07"))
# Optimal number of clusters for k-meanslibrary("factoextra")my_data <- scale(USArrests)fviz_nbclust(my_data, kmeans, method = "gap_stat")
I would like to thank Fabian Mundt for his active contributions to factoextra.
We sincerely thank all developers for their efforts behind the packages that factoextra depends on, namely, ggplot2 (Hadley Wickham, Springer-Verlag New York, 2009), FactoMineR (Sebastien Le et al., Journal of Statistical Software, 2008), dendextend (Tal Galili, Bioinformatics, 2015), cluster (Martin Maechler et al., 2016) and more .....
invisible
works properly in the function fviz_pca_biplot()
(@ginolhac, #26).fviz_dend()
now works for an object of class diana
(@qfazille, #30).fviz_cluster()
supports HCPC results (@famuvie, #34).New argument mean.point
in the function fviz()
. logical value. If TRUE, group mean points are added to the plot.
Now, PCA correlation circles have fixed coordinates so they don't appear as ellipses (@scoavoux, #38.
New argument fill.ind
and fill.var
added in fviz_pca()
(@ginolhac, #27 and @Confurious, #42).
New arguments geom.ind
and geom.var
in fviz_pca_xxx()
and fviz_mca_xxx()
functions to have more controls on the individuals/variables geometry in the functions fviz_pca_biplot()
and fviz_mca_biplot()
(@Confurious, #42).
New arguments geom.row
and geom.col
in fviz_ca_xxx()
functions to have more controls on the individuals/variables geometry in the function fviz_ca_biplot()
(@Confurious, #42).
New argument gradient.cols
in fviz_pca_biplot()
New argument àxes
in fviz_cluster
() to specify the dimension to plot.
New argument circlesize
in the function fviz()
to change the size of the variable correlation circle size.
It's now possible to color individuals using a custom continuous variable (#29). This is done using the argument col.ind.
library(factoextra)data(iris)res.pca <- prcomp(iris[, -5], scale = TRUE) # Visualize and color by a custom continuous variablefviz_pca_ind(res.pca, col.ind = iris$Sepal.Length, legend.title = "Sepal.Length")
library(FactoMineR)library(factoextra) .tbl2.1 <- matrix(c(395, 2456,1758, 147, 153, 916, 694, 327, 1347),byrow=T,3,3)dimnames(.tbl2.1) <- list(地域=c("オスロ","中部地域","北部地域"), 犯罪=c("強盗", "詐欺","破壊") ) res.CA <- CA(.tbl2.1,graph=FALSE) fviz_ca_biplot(res.CA,map="simbiplot",title="simbiplot", font.family = "HiraKakuProN-W3")
New function fviz_mclust()
for plotting model-based clustering using ggplot2.
New function fviz()
: Generic function to create a scatter plot of multivariate analyse outputs, including PCA, CA and MCA, MFA, ...
New functions fviz_mfa_var()
and fviz_hmfa_var()
for plotting MFA and HMFA variables, respectively.
New function get_mfa_var()
: Extract the results for variables (quantitatives, qualitatives and groups). Deprecated functions: get_mfa_var_quanti()
, get_mfa_var_quali()
and get_mfa_group()
.
New functions added for extracting and visualizing the results of FAMD (factor analysis of mixed data): get_famd_ind()
, get_famd_var()
, fviz_famd_ind()
and fviz_famd_var()
.
Now fviz_dend()
returns a ggplot. It can be used to plot circular dendrograms and phylogenic-like trees. Additionnally, it supports an object of class HCPC (from FactoMineR).
New arguments in fviz_cluster()
:
fviz_cluster()
: to change the plot main title and axis labels.New argument pointshape in fviz_pca()
. When you use habillage, point shapes change automatically by groups. To avoid this behaviour use for example pointshape = 19 in combination with habillage (@raynamharris, #15).
New argument repel in fviz_add()
.
New argument gradient.cols in fviz_*() functions.
Support for the ExPosition package added (epCA, epPCA, epMCA) (#23)
Check point added in the function fviz_nbclust()
to make sure that x is an object of class data.frame or matrix (Jakub Nowosad, #15).
The following arguments are deprecated in fviz_cluster
(): title, frame, frame.type, frame.level, frame.alpha. Now, use main, ellipse, ellipse.type, ellipse.level and ellipse.alpha instead.
Now, by default, the function fviz_cluster
() doesn't show cluster mean points for an object of class PAM and CLARA, when the argument show.clust.cent is missing . This is because cluster centers are medoids in the case of PAM and CLARA but not means. However, user can force the function to display the mean points by using the argument show.clust.cent = TRUE.
The argument jitter is deprecated; use repel = TRUE instead, to avoid overlapping of labels.
New argument "sub" in fviz_dend()
for adding a subtitle to the dendrogram. If NULL, the method used hierarchical clustering is shown. To remove the subtitle use sub = "".
fviz_cluster()
can handle HCPC object obtained from MCA (Alejandro Juarez-Escario, #13)fviz_ca_biplot()
reacts when repel = TRUE usedfacto_summarize()
, now the contribution values computed for >=2 axes are in percentage (#22)fviz_ca()
and fviz_mca()
now work with the latest version of ade4 v1.7-5 (#24)New fviz_mfa function to plot MFA individuals, partial individuals, quantitive variables, categorical variables, groups relationship square and partial axes (@inventionate, #4).
New fviz_hmfa function to plot HMFA individuals, quantitive variables, categorical variables and groups relationship square (@inventionate, #4).
New get_mfa and get_hmfa function (@inventionate, #4).
fviz_ca, fviz_pca, fviz_mca, fviz_mfa and fviz_hmfa ggrepel support (@inventionate, #4).
Updated fviz_summarize, eigenvalue, fviz_contrib and fviz_cos2 functions, to compute FactoMineR MFA and HMFA results (@inventionate, #4).
fviz_cluster() added. This function can be used to visualize the outputs of clustering methods including: kmeans() [stats package]; pam(), clara(), fanny() [cluster package]; dbscan() [fpc package]; Mclust() [mclust package]; HCPC() [FactoMineR package]; hkmeans() [factoextra].
fviz_silhouette() added. Draws the result of cluster silhouette analyses computed using the function silhouette()[cluster package]
fviz_nbclust(): Dertemines and visualize the optimal number of clusters
fviz_gap_stat(): Visualize the gap statistic generated by the function clusGap() [in cluster package]
hcut(): Computes hierarchical clustering and cut the tree into k clusters.
hkmeans(): Hierarchical k-means clustering. Hybrid approach to avoid the initial random selection of cluster centers.
get_clust_tendency(): Assessing clustering tendency
fviz_dend(): Enhanced visualization of dendrogram
eclust(): Visual enhancement of clustering analysis
get_dist() and fviz_dist(): Enhanced Distance Matrix Computation and Visualization
eclust(): Visual enhancement of clustering analysis