Provides a framework to perform Non-negative Matrix Factorization (NMF). The package implements a set of already published algorithms and seeding methods, and provides a framework to test, develop and plug new/custom algorithms. Most of the built-in algorithms have been optimized in C++, and the main interface function provides an easy way of performing parallel computations on multicore machines.
Nonnegative Matrix Factorization (NMF) is an unsupervised learning technique that has been applied successfully in several fields, including signal processing, face recognition and text mining. Recent applications of NMF in bioinformatics have demonstrated its ability to extract meaningful information from high-dimensional data such as gene expression microarrays. Developments in NMF theory and applications have resulted in a variety of algorithms and methods. However, most NMF implementations have been on commercial platforms, while those that are freely available typically require programming skills. This limits their use by the wider research community.
Our objective is to provide the bioinformatics community with an open-source, easy-to-use and unified interface to standard NMF algorithms, as well as with a simple framework to help implement and test new NMF methods. For that purpose, we have developed a package for the R/BioConductor platform. The package ports public code to R, and is structured to enable users to easily modify and/or add algorithms. It includes a number of published NMF algorithms and initialization methods and facilitates the combination of these to produce new NMF strategies. Commonly used benchmark data and visualization methods are provided to help in the comparison and interpretation of the results.
The NMF package helps realize the potential of Nonnegative Matrix Factorization, especially in bioinformatics, providing easy access to methods that have already yielded new insights in many applications.
Documentation, source code and sample data are available from:
Latest stable release from CRAN: http://cran.r-project.org/package=NMF
Changes in version 0.20.6
FIXES o fixed new NOTEs in R CMD check (about requireNamespace) o fixed error in heatmaps due to new version of stringr (>= 1.0.0)
Changes in version 0.20
NEW FEATURES o aheatmap gains an argument txt that enables displaying text in each cell of the heatmap.
FIXES o now all examples, vignettes and unit tests comply with CRAN policies on the maximum number of cores to be used when checking packages (2).
Changes in version 0.18
NEW FEATURES o aheatmap gains distfun methods 'spearman', 'kendall', and, for completeness, 'pearson' (for which 'correlation' is now an alias), which specifies the correlation method to use when computing the distance matrix.
CHANGES o In order to fully comply with CRAN policies, internals of the aheatmap function slightly changed. In particular, this brings an issue when directly plotting to PDF graphic devices, where a first blank page may appear. See the dedicated section in the man page ?aheatmap for a work around.
Changes in version 0.17.3
NEW FEATURES o add silhouette computation for NMF results, which can be performed on samples, features or consensus matrix. The average silhouette width is also computed by the summary methods. (Thanks to Gordon Robertson for this suggestion)
o New runtime option 'shared.memory' (or 'm') for toggling usage of shared memory (requires package synchronicity).
CHANGES o some plots are -- finally -- generated using ggplot2 This adds an Imports dependency to ggplot2 and reshape2.
BUG FIXES o fix a bug when running parallel NMF computation with algorithms defined in other packages (this most probably only affected my own packages, e.g., CellMix)
Changes in version 0.17.1
CHANGES o Computations seeded with an NMF object now set slot @seed to 'NMF' o Added unit tests on seeding with an NMF object o Removed some obsolete comments
BUG FIXES o an error was thrown when running multiple sequential NMF computations with nrun>=50: object '.MODE_SEQ' not found. (reported by Kenneth Lopiano)
Changes in version 0.16.5
CHANGES o Now depends on pkgmaker 0.16 o Disabled shared memory on Mac machines by default, due to un-resolved bug that occurs with big-matrix descriptors. It can be restored using nmf.options(shared.memory=TRUE) o Package version is now shown on startup message o Re-enabled unit test checks on CRAN, using the new function pkgmaker::isCHECK, that definitely identifies if tests are run under R CMD check.
Changes in version 0.15.2 *
o New function nmfObject() to update old versions of NMF objects, eg., saved on disk. o NMF strategies can now define default values for their parameters, eg., `seed` to associate them with a seeding method that is not the default method, `maxIter` to make default runs with more iterations (for algorithms defined as NMFStrategyIterative), or any other algorithm-specific parameter. o new general utility function hasArg2 that is identical to hasArg but takes the argument name as a character string, so that no check NOTE is thrown.
o All vignettes are now generated using knitr. o New dependency to pkgmaker which incorporated many of the general utility functions, initially defined for the NMF package. o example check is faster (as requested by CRAN) o the function selectMethodNMF was renamed into selectNMFMethod to be consistent with the other related *NMFMethod functions (get, set, etc..)
o Fix computation of R-squared in profplot/corplot: now compute from a linear fit that includes an intercept.
Changes in version 0.8.6 *
o Formula based NMF models that can incorporate fixed terms, which may be used to correct for covariates or define group specific offsets.
o Subsetting an NMF object with a single index now returns an NMF object, except if argument drop is not missing (i.e. either TRUE or FALSE).
Changes in version 0.6.03 *
WARNING Due to the major changes made in the internal structure of the standard NMF models, previous NMF fits are not compatible with this version.
o The factory function nmfModel has been enhanced and provides new methods that makes more easier the creation of NMF objects. See ?nmfModel. o A new heatmap drawing function 'aheatmap' (for annotated heatmap) is now used to generate the different heatmaps (basismap, coefmap and consensusmap). It is a enhancement/fork of the function pheatmap from package pheatmap and draw -- I think -- very nice heatmaps, providing a convenient and flexible interface to add annotation tracks to both the columns and rows, with sensible automatic legends.
o Method nmfModel when called with no arguments does not return anymore the list of available NMF models, but an empty NMF model. To list the available models, directly call `nmfModels()`. o The function `rmatrix` is now a S4 generic function. It gains methods for generating random matrices based on a template matrix or an NMF model. See ?rmatrix. o The function `rnmf` gains a method to generate a random NMF model given numerical dimensions. o The function nmfEstimateRank now returns the fits for each value of the rank in element 'fit' of the result list. See ?nmfEstimateRank.
Changes in version 0.5.3 *
o The state of the random number generator is systematically stored in the 'NMFfit' object returned by function 'nmf'. It is stored in a new slot 'rng.seed' and can be access via the new method 'rngSeed' of class 'NMFfit'. See ?rngSeed for more details. o The number of cores to use in multicore computations can now also be specified by the 'p - parallel' runtime option (e.g. 'p4' to use 4 cores). Note that the value specified in option 'p' takes precedence on the one passed via argument '.pbackend'. See section 'Runtime options' in ?nmf for more details. o Function 'nmfApply' now allows values 3 and 4 for argument 'MARGIN' to apply a given function to the basis vectors (i.e. columns of the basis matrix W) and basis profiles (i.e. rows of the mixture coefficients H) respectively. See ?nmfApply for more details. o New S4 generic 'basiscor' and 'profcor' to compute the correlation matrices of the basis vectors and basis profiles respectively from two NMF models or from an NMF model and a given compatible matrix. See ?basiscor or ?profcor for more details. o New S4 generic 'fitcmp' to compare the NMF models fitted by two different runs of function 'nmf' (i.e. two 'NMFfit' objects). See ?fitcmp for more details. o New S4 generic 'canFit' that tells if an NMF method is able to exactly or partly fit a given NMF model. See ?canFit for more details. o New function 'selectMethodNMF' that selects an appropriate NMF method to fit a given NMF model. See ?selectMethodNMF for more details. o The verbosity level can be controlled more finely by the 'v - verbose' runtime option (e.g. using .options='v1' or 'v2'). The greater is the level the more information is printed. The verbose outputs have been cleaned-up and should be more consistent across the run mode (sequential or multicore). See section 'Runtime options' in ?nmf for more details.
o The standard update equations have been optimised further, by making them modify the factor matrices in place. This speeds up the computation and greatly improves the algorithms' memory footprint. o The package NMF now depends on the package digest that is used to display the state of random number generator in a concise way. o The methods 'metaHeatmap' are split into 3 new S4 generic 'basismap' to plot a heatmap of the basis matrix [formerly plotted by the call : 'metaHeatmap(object.of.class.NMF, 'features', ...) ], 'coefmap' to plot a heatmap of the mixture coefficient matrix [formerly plotted by the call : 'metaHeatmap(object.of.class.NMF, 'samples', ...) ], and 'consensusmap' to plot a heatmap of the consensus matrix associated with multiple runs of NMF [formerly plotted by the call : 'metaHeatmap(object.of.class.NMFfitX, ...) ].
o In factory method 'nmfModel': the colnames (resp. rownames) of matrix W (resp. H) are now correctly set when the rownames of H (resp. the colnames of W) are not null.
NEWLY DEPRECATED CLASSES, METHODS, FUNCTIONS, DATA SETS
o Deprecated Generics/Methods 1) 'metaHeatmap,NMF' and 'metaHeatmap,NMFfitX' - S4 methods remain with .Deprecated message. They are replaced by the definition of 3 new S4 generic 'basismap', 'coefmap' and 'consensusmap'. See the related point in section CHANGES above. They will be completely removed from the package in the next version.
Changes in version 0.5.1 *
o fix a small bug in method 'rss' to allow argument 'target' to be a 'data.frame'. This was generating errors when computing the summary measures and especially when the function 'nmfEstimateRank' was called on a 'data.frame'. Thanks to Pavel Goldstein for reporting this.
Changes in version 0.5 *
o Method 'fcnnls' provides a user-friendly interface for the internal function '.fcnnls' to solve non-nengative linear least square problems, using the fast combinatorial approach from Benthem and Keenan (2004). See '?fcnnls' for more details. o New argument 'callback' in method 'nmf' allows to provide a callback function when running multiple runs in default mode, that only keeps the best result. The callback function is applied to the result of each run before it possibly gets discarding. The results are stored in the miscellaneous slot '.callback' accessible via the '$' operator (e.g. res$.callback). See '?nmf' for more details. o New method 'niter' to retrieve the number of iterations performed to fit a NMF model. It is defined on objects of class 'NMFfit'. o New function 'isNMFfit' to check if an object is a result from NMF fitting. o New function 'rmatrix' to easily generate random matrices, and allow to specify the distribution from which the entries are drawn. For example: * 'rmatrix(100, 10)' generates a 100x10 matrix whose entries are drawn from the uniform distribution * 'rmatrix(100, 10, rnorm)' generates a 100x10 matrix whose entries are drawn from the standard Normal distribution. o New methods 'basisnames' and 'basisnames<-' to retrieve and set the basis vector names. See '?basisnames'.
o Add a CITATION file that provides the bibtex entries for citing the BMC Bioinformatics paper for the package (http://www.biomedcentral.com/1471-2105/11/367), the vignette and manual. See 'citation('NMF')' for complete references. o New argument 'ncol' in method 'nmfModel' to specify the target dimensions more easily by calls like 'nmfModel(r, n, p)' to create a r-rank NMF model that fits a n x p target matrix. o The subset method '' of objects of class 'NMF' has been changed to be more convenient. See
o Method 'dimnames' for objects of class 'NMF' now correctly sets the names of each dimension.
REMOVED CLASSES, METHODS, FUNCTIONS, DATA SETS
o Method 'extra' has completely been removed.
Changes in version 0.4.8 *
o When computing NMF with the SNMF/R(L) algorithms, an error could occur if a restart (recomputation of the initial value) for the H matrix (resp. W matrix) occured at the first iteration. Thanks to Joe Maisog for reporting this.
Changes in version 0.4.7 *
o When computing the cophenetic correlation coefficient of a diagonal matrix, the 'cophcor' method was returning 'NA' with a warning from the 'cor' function. It now correctly returns 1. Thanks to Joe Maisog for reporting this.
Changes in version 0.4.4 *
o The major change is the explicit addition of the synchronicity package into the suggested package dependencies. Since the publication of versions 4.x of the bigmemory package, it is used by the NMF package to provide the mutex functionality required by multicore computations. Note This is relevant only for Linux/Mac-like platforms as the multicore package is not yet supported on MS Windows. Users using a recent version of the bigmemory package (i.e. >=4.x) DO NEED to install the synchronicity package to be able to run multicore NMF computations. Versions of bigmemory prior to 4.x include the mutex functionality. o Minor enhancement in error messages o Method 'nmfModel' can now be called with arguments 'rank' and 'target' swapped. This is for convenience and ease of use.
o Argument 'RowSideColors' of the 'metaHeatmap' function is now correctly subset according to the value of argument 'filter'. However the argument must be named with its complete name 'RowSideColors', not assuming partial match. See KNOWN ISSUES.
o In the 'metaHeatmap' function, when argument 'filter' is not 'FALSE': arguments 'RowSideColors', 'labRow', 'Rowv' that are intended to the 'heatmap.plus'-like function (and are used to label the rows) should be named using their complete name (i.e. not assuming partial match), otherwise the filtering is not applied to these argument and an error is generated. This issue will be fixed in a future release.
Changes in version 0.4.3 *
o function 'nmfEstimateRank' has been enhanced: -run options can be passed to each internal call to the 'nmf' function. See ?nmfEstiamteRank and ?nmf for details on the supported options. - a new argument 'stop' allows to run the estimation with fault tolerance, skipping runs that throw an error. Summary measures for these runs are set to NAs and a warning is thrown with details about the errors. o in function 'plot.NMF.rank': a new argument 'na.rm' allows to remove from the plots the ranks for which the measures are NAs (due to errors during the estimation process with 'nmfEstimateRank').
o Method 'consensus' is now exported in the NAMESPACE file. Thanks to Gang Su for reporting this. o Warnings and messages about loading packages are now suppressed. This was particularly confusing for users that do not have the packages and/or platform required for parallel computation: warnings were printed whereas the computation was -- sequentially -- performed without problem. Thanks to Joe Maisog for reporting this.
Changes in version 0.4.1 *
o The 'metaHeatmap' function was not correctly handling row labels when argument filter was not FALSE. All usual row formating in heatmaps (label and ordering) are now working as expected. Thanks to Andreas Schlicker, from The Netherlands Cancer Institute for reporting this. o An error was thrown on some environments/platforms (e.g. Solaris) where not all the packages required for parallel computation were not available -- even when using option 'p' ('p' in lower case), which should have switched the computation to sequential. This is solved and the error is only thrown when running NMF with option 'P' (capital 'P'). o Not all the options were passed (e.g. 't' for tracking) in sequential mode (option '-p'). o verbose/debug global nmf.options were not restored if a numerical random seed were used.
o The 'metaHeatmap' function nows support passing the name of a filtering method in argument 'filter', which is passed to method 'extractFeatures'. See ?metaHeatmap. o Verbose and debug messages are better handled. When running a parallel computation of multiple runs, verbose messages from each run are shown only in debug mode.
Changes in version 0.4 *
o Part of the code has been optimised in C++ for speed and memory efficiency: - the multiplicative updates for reducing the KL divergence and the euclidean distance have been optimised in C++. This significantly reduces the computation time of the methods that make use of them: 'lee', 'brunet', 'offset', 'nsNMF' and 'lnmf'. Old R version of the algorithm are still accessible with the suffix '.R#'. - the computation of euclidean distance and KL divergence are implemented in C++, and do not require the duplication of the original matrices as done in R. o Generic 'dimnames' is now defined for objects of class 'NMF' and returns a list with 3 elements: the row names of the basis matrix, the column names of the mixture coefficient matrix , and the column names of the basis matrix. This implies that methods 'rownames' and 'columnames' are also available for 'NMF' objects. o A new class structure has been developed to handle the results of multiple NMF runs in a cleaner and more structured way: - Class 'NMFfitX' defines a common interface for multiple NMF runs of a single algorithm. - Class 'NMFfitX1' handles the sub-case where only the best fit is returned. In particular, this class allows to handle such results as if coming from a single run. - Class 'NMFfitXn' handles the sub-case where the list of all the fits is returned. - Class 'NMFList' handles the case of heterogeneous NMF runs (different algorithms, different factorization rank, different dimension, etc...) o The vignette contains more examples and details about the use of package. o The package is compatible with both versions 3.x and 4.x of the bigmemory package. This package is used when running multicore parallel computations. With version 4.x of bigmemory, the synchronicity package is also required as it provides the mutex functionality that used to be provided by bigmemory 3.x.
o Running in multicore mode from the GUI on MacOS X is not allowed anymore as it is not safe and were throwing an error ['The process has forked and ...']. Thanks to Stephen Henderson from the UCL Cancer Institute (UK) for reporting this. o Function 'nmf' now restores the random seed to its original value as before its call with a numeric seed. This behaviour can be disabled with option 'restore.seed=FALSE' or '-r'
NEWLY DEPRECATED CLASSES, METHODS, FUNCTIONS, DATA SETS
o Deprecated Generics/Methods 1) 'errorPlot' - S4 generic/methods remains with .Deprecated message. It is replaced by a definition of the 'plot' method for signatures 'NMFfit,missing' and 'NMFList,missing' It will be completely removed from the package in the next version. o Deprecated Class 1) 'NMFSet' - S4 class remains for backward compatibility, but is not used anymore. It is replaced by the classes 'NMFfitX1', 'NMFfitXn', 'NMFList'.
Changes in version 0.3 *
o Now requires R 2.10 o New list slot 'misc' in class 'NMF' to be able to define new NMF models, without having to extend the S4 class. Access is done by methods '$' and '$<-'. o More robust and convenient interface 'nmf' o New built-in algorithm : PE-NMF [Zhang (2008)] o The vignette and documentation have been enriched with more examples and details on the functionalities. o When possible the computation is run in parallel on all the available cores. See option 'p' or 'parallel=TRUE' in argument '.options' of function 'nmf'. o Algorithms have been optimized and run faster o Plot for rank estimation: quality measure curves can be plotted together with a set of reference measures. The reference measures could for example come from the rank estimation of randomized data, to investigate overfitting. o New methods '$' and '$<-' to access slot 'extra' in class 'NMFfit' These methods replace method 'extra' that is now defunct. o Function 'randomize' allows to randomise a target matrix or 'ExpressionSet' by permuting the entries within each columns using a different permutation for each column. It is useful when checking for over-fitting.
o The 'random' method of class 'NMF' is renamed 'rnmf', but is still accessible through name 'random' via the 'seed' argument in the interface method 'nmf'.
NEWLY DEFUNCT CLASSES, METHODS, FUNCTIONS, DATA SETS
o Defunct Generics/Methods 1) 'extra' - S4 generic/methods remains with .Defunct message. It will be completely removed from the package in the next version.
Changes in version 0.2.4 *
o Class 'NMFStrategy' has a new slot 'mixed', that specify if the algorithm can handle mixed signed input matrices. o Method 'nmf-matrix,numeric,function' accepts a new parameter 'mixed' to specify if the custom algorithm can handle mixed signed input matrices.
Changes in version 0.2.3 *
o Package 'Biobase' is not required anymore, and only suggested. The definition and export of the NMF-BioConductor layer is done at loading. o 'nmfApply' S4 generic/method: a 'apply'-like method for objects of class 'NMF'. o 'predict' S4 method for objects of class 'NMF': the method replace the now deprecated 'clusters' method. o 'featureNames' and 'sampleNames' S4 method for objects of class 'NMFSet'. o sub-setting S4 method '[' for objects of class 'NMF': row subsets are applied to the rows of the basis matrix, column subsets are applied to the columns of the mixture coefficient matrix.
o method 'featureScore' has a new argument 'method' to allow choosing between different scoring schema. o method 'extractFeatures' has two new arguments: 'method' allows choosing between different scoring and selection methods; 'format' allows to specify the output format. NOTE: the default output format has changed. The default result is now a list whose elements are integer vectors (one per basis component) that contain the indices of the basis-specific features. It is in practice what one is usually interested in.
o Methods 'basis<-' and 'coef<-' were not exported by file NAMESPACE. o Method 'featureNames' was returning the column names of the mixture coefficient matrix, instead of the row names of the basis matrix. o 'metaHeatmap' with argument 'class' set would throw an error.
NEWLY DEPRECATED CLASSES, METHODS, FUNCTIONS, DATA SETS
o Deprecated Generics/Methods 1) clusters - S4 generic/methods remain with .Deprecated message