Diverse Cluster Ensemble in R

Performs cluster analysis using an ensemble clustering framework. Results from a diverse set of algorithms are pooled together using methods such as majority voting, K-Modes, LinkCluE, and CSPA. There are options to compare cluster assignments across algorithms using internal and external indices, visualizations such as heatmaps, and significance testing for the existence of clusters.


Travis-CI Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge

Overview

The goal of diceR is to provide a systematic framework for generating diverse cluster ensembles in R. There are a lot of nuances in cluster analysis to consider. We provide a process and a suite of functions and tools to implement a systematic framework for cluster discovery, guiding the user through the generation of a diverse clustering solutions from data, ensemble formation, algorithm selection and the arrival at a final consensus solution. We have additionally developed visual and analytical validation tools to help with the assessment of the final result. We implemented a wrapper function dice() that allows the user to easily obtain results and assess them. Thus, the package is accessible to both end user with limited statistical knowledge. Full access to the package is available for informaticians and statisticians and the functions are easily expanded.

Installation

You can install diceR from CRAN with:

install.packages("diceR")

Or get the latest development version from GitHub:

devtools::install_github("AlineTalhouk/diceR")

Example

The following example shows how to use the main function of the package, dice(). A data matrix hgsc contains a subset of gene expression measurements of High Grade Serous Carcinoma Ovarian cancer patients from the Cancer Genome Atlas publicly available datasets. Samples as rows, features as columns. The function below runs the package through the dice() function. We specify (a range of) nk clusters over reps subsamples of the data containing 80% of the full samples. We also specify the clustering algorithms to be used and the ensemble functions used to aggregated them in cons.funs.

library(diceR)
data(hgsc)
obj <- dice(hgsc, nk = 4, reps = 5, algorithms = c("hc", "diana"),
            cons.funs = c("kmodes", "majority"))

The first few cluster assignments are shown below:

knitr::kable(head(obj$clusters))
kmodes majority
TCGA.04.1331_PRO.C5 3 3
TCGA.04.1332_MES.C1 3 3
TCGA.04.1336_DIF.C4 1 3
TCGA.04.1337_MES.C1 3 3
TCGA.04.1338_MES.C1 3 3
TCGA.04.1341_PRO.C5 3 3

You can also compare the base algorithms with the cons.funs using internal evaluation indices:

knitr::kable(obj$indices$ii$`4`)
Algorithms calinski_harabasz dunn pbm tau gamma c_index davies_bouldin mcclain_rao sd_dis ray_turi g_plus silhouette s_dbw Compactness Connectivity
HC_Euclidean 4.945499 0.3025234 38.34704 0.1992999 0.5598731 0.3122823 3.100302 0.8237540 0.1795670 3.0886000 0.0278858 0.0300838 NaN 24.81662 49.69405
DIANA_Euclidean 51.332198 0.3348103 32.92726 0.4271483 0.6216897 0.1639431 3.037874 0.8077658 0.2034291 3.1687896 0.0892952 0.0700862 NaN 22.05147 227.34841
kmodes 39.127460 0.3352598 49.27019 0.3907289 0.5528538 0.2020221 1.563373 0.8254116 0.1046540 1.1356906 0.1116735 NaN 0.7207352 22.66419 148.61865
majority 5.645220 0.4315581 96.93674 0.2221915 0.7330421 0.2458043 1.379460 0.7781939 0.0948754 0.8261741 0.0122634 NaN 0.7224928 24.70600 24.35079

Pipeline

This figure is a visual schematic of the pipeline that dice() implements.

Ensemble Clustering pipeline.

Please visit the overview page for more detail.

News

diceR 0.3.2

  • Fix bug in consensus_cluster() when custom algorithms were excluded from output (thanks @phiala)

  • Use markdown language for documentation

  • Various performance improvements and code simplifications

diceR 0.3.1

  • Suppress success/fail message printout and fix input data to be matrix for block clustering

  • Fix bug in algii_heatmap() when k.method = "all" in dice()

  • Fix bug in calculating internal indices when data has categorical variables (thanks Kurt Salmela)

diceR 0.3.0

  • Updated object output names in consensus_evaluate()

  • Fix unit test in test-dice.R for R-devel

  • Add internal function: ranked algorithms vs internal validity indices heatmap graph

  • Fix bugs in graph_cdf(), graph_tracking() when only one k selected

  • Progress messages in dice()

  • Fix bug in consensus_evaluate() when algorithm has NA for all PAC values

diceR 0.2.0

  • New dimension reduction methods: t-SNE, largeVis (@dustin21)

  • Better annotated progress bar using progress package

  • Speed up the operation that transforms a matrix to become "NMF-ready"

  • Simplify saving mechanism in consensus_cluster() such that only file.name needs to be specified, and the save parameter has been removed

  • New algorithms: SOM, Fuzzy C-Means, DBSCAN (@dustin21, #118)

  • Added significance testing section to vignette

  • Fixed direction of optimization: compactness and connectivity should be minimized

diceR 0.1.0

  • New submission to CRAN accepted on June 21, 2017

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.