Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering with the option to plot, validate, predict (new data) and estimate the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to
speed up the computationally intensive parts of the functions. For more information, see (i) "Clustering in an Object-Oriented Environment" by Anja Struyf, Mia Hubert, Peter Rousseeuw (1997), Journal of Statistical Software,

The ClusterR package consists of Gaussian mixture models, k-means, mini-batch-kmeans and k-medoids clustering algorithms with the option to plot, validate, predict (new data) and find the optimal number of clusters. The package takes advantage of 'RcppArmadillo' to speed up the computationally intensive parts of the functions. More details on the functionality of ClusterR can be found in the package Vignette. ClusterR can be installed, currently, in the following OS's: Linux, Mac and Windows.

To install the package from CRAN use,

` install.packages("ClusterR") `

and to download the latest version from Github use the *install_github* function of the devtools package,

` devtools::install_github('mlampros/ClusterR') `

Use the following link to report bugs/issues,

- I added the
*DARMA_64BIT_WORD*flag in the Makevars file to allow the package processing big datasets - I modified the
*kmeans_miniBatchKmeans_GMM_Medoids.cpp*file and especially all*Rcpp::List::create()*objects to addrress the clang-ASAN errors.

- I modified the
*Optimal_Clusters_KMeans*function to return a vector with the*distortion_fK*values if criterion is*distortion_fK*(instead of the*WCSSE*values). - I added the 'Moore-Penrose pseudo-inverse' for the case of the 'mahalanobis' distance calculation.

- I modified the
*OpenMP*clauses of the .cpp files to address the ASAN errors. - I removed the
*threads*parameter from the*KMeans_rcpp*function, to address the ASAN errors ( negligible performance difference between threaded and non-threaded version especially if the*num_init*parameter is less than 10 ). The*threads*parameter was removed also from the*Optimal_Clusters_KMeans*function as it utilizes the*KMeans_rcpp*function to find the optimal clusters for the various methods.

I modified the *kmeans_miniBatchKmeans_GMM_Medoids.cpp* file in the following lines in order to fix the clang-ASAN errors (without loss in performance):

- lines 1156-1160 : I commented the second OpenMp parallel-loop and I replaced the
*k*variable with the*i*variable in the second for-loop [in the*dissim_mat()*function] - lines 1739-1741 : I commented the second OpenMp parallel-loop [in the
*silhouette_matrix()*function] - I replaced (all) the
*silhouette_matrix*(arma::mat) variable names with*Silhouette_matrix*, because the name overlapped with the name of the Rcpp function [in the*silhouette_matrix*function] - I replaced all
*sorted_medoids.n_elem*with the variable*unsigned int sorted_medoids_elem*[in the*silhouette_matrix*function]

I modified the following *functions* in the *clustering_functions.R* file:

*KMeans_rcpp()*: I added an*experimental*note in the details for the*optimal_init*and*quantile_init*initializers.*Optimal_Clusters_KMeans()*: I added an*experimental*note in the details for the*optimal_init*and*quantile_init*initializers.*MiniBatchKmeans()*: I added an*experimental*note in the details for the*optimal_init*and*quantile_init*initializers.

The *normalized variation of information* was added in the *external_validation* function (https://github.com/mlampros/ClusterR/pull/1)

I fixed the valgrind memory errors

I removed the warnings, which occured during compilation.
I corrected the UBSAN memory errors which occured due to a mistake in the *check_medoids()* function of the *utils_rcpp.cpp* file.
I also modified the *quantile_init_rcpp()* function of the *utils_rcpp.cpp* file to print a warning if duplicates are present in the initial centroid matrix.

- I updated the dissimilarity functions to accept data with missing values.
- I added an error exception in the predict_GMM() function in case that the determinant is equal to zero. The latter is possible if the data includes highly correlated variables or variables with low variance.
- I replaced all unsigned int's in the rcpp files with int data types

I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results

I modified the RcppArmadillo functions so that ClusterR passes the Windows and OSX OS package check results