High-Quality Visualizations of Large, High-Dimensional Datasets
Implements the largeVis algorithm (see Tang, et al. (2016) < https://arxiv.org/abs/1602.00370>) for visualizing very large high-dimensional datasets. Also very fast search for approximate nearest neighbors; outlier detection; and optimized implementations of the HDBSCAN*, DBSCAN and OPTICS clustering algorithms; plotting functions for visualizing the above.
- Hotfix for issue the caused largeVis to fail if compiled without 64-bit ARMA
- Fix for issue reported by @Dalar where sparse neighbor search could fail with division by zero error
- This is a significant update. Major changes are:
- 20% performance improvements in
- The nearest neighbor search now runs substantially faster because of more efficient use of threads.
projectKNNs now supports training with momentum, which offers an additional 20-50% performance improvement. See the vignette for details.
- OPTICS and DBSCAN are back, rewritten, and substantially improved.
- Other changes:
useDegree parameter allows the user to select the negative weighting method.
largeVis now has an additional parameter,
save_edges, to control whether the edge matrix is preserved. This is to simplify using the
lv_optics now all accept a
largeVis object as the first parameter.
- Both edges and neighbors (or a largeVis object) must be specified for
hdbscanToDendrogram function to make hdbscan objects compatible with other R hierarchical clustering implementations.
- New utility function
sgdBatches helps estimate training time for datasets.
- Fixed bug in estimation of
sgd_batches where 10x to many batches would be used for dataset < 10000 nodes.
distance now store the
distance_method in attribute
method of the returned object.
- The benchmarks vignette has been replaced with a benchmarks readme - the reason is the overhead and expense of repeatedly recalculating the benchmarks on AWS.
- Third vignette covers momentum, the
useDegree parameter, and clustering.
Fix to hdbscan selecting wrong K, and gplot failing.
Hotfix for a bug in the neighbor search when max iterations was 0.
The OPTICS implementation has been temporarily removed. This reason is that the code was based on the code in the
dbscan package, and the CRAN
administrators objected to the inclusion of a separate copyright notice. OPTICS will be restored once the code is sufficiently re-written.
+ Fixed insidious bug that would arise if the edge matrix contained distances > 27.
+ Fixed bug in hdbscan where it would mistakenly conclude that it lacked sufficient neighbors.
+ Fixed bug in buildWijMatrix that caused a segfault on certain 32-bit systems in certain cases.
+ Change to address apparent upstream issue causing compilation problem on 32-bit systems.
+ Fixed bug in hdbscan caused by the knn matrix not matching output from randomProjectionTreeSearch.
- Algorithm Improvements
+ Hdbscan now uses a PairingHeap insead of a Binary Heap, which should improve performance on large datasets.
+ NOTE: buildEdgeMatrix no2 incorporates a regularization constant. Duplicates are now recorded as being a Euclidean distances of 1e-5 apart. The reason for this is that the Matrix package and armadillo sparse matrices do not preserve information about zeros. As a result, with datasets with large numbers of duplicates, the edge matrices appeared to contain no (or too little) neighbor data for some rows. This created issues with other functions dependent on the output of buildEdgeMatrix, such as hdbscan. Adding a regularization constant fixes the issue.
- Interface Improvements
+ On load, now inform the user if the package was compiled for 32-bit or without OpenMP.
- hdbscan algorithm added
- Added thread number parameter to facilitate CRAN limitation on number of cores
- Removed facevector data to facilitate CRAN size limit
- Miscellaneous small changes for CRAN submission
- Note that as of August 22, compilation difficulties on Windows began to appear. This were likely caused by an update to either RcppArmadillo or some Win32-specific software. While I believe the issue has been worked-around, please contact me if you experience any issues dealing with very large datasets on Win32 systems.
- Bug fixes
+ Largely reduced the "fuzzies"
- API Improvements
+ Allow the seed to be set for projectKNNs and randomProjectionTreeSearch
+ If a seed is given, multi-threading is disabled during sgd and the annoy phase of the neighbor search. These
phases of the algorithm would otherwise be non-deterministic. Note that the performance impact is substantial.
+ Verbosity now defaults to the R system option
+ The neighbor matrix returned by randomProjectionTreeSearch is now sorted by distance
+ Improved testing for cosine similarity
+ Many tests are improved by ability to set seed
+ LOF search now tested and exported.
- Refactorings & Improvements
+ Refactored neighbor search to unify code for sparse and dense neighbors, substantially improving sparse performance
+ Now using managed pointers in many places
- Revisions for CRAN release, including verifying correctness by reproducing paper examples, and timing tests/benchmarks
- Tested against the paper authors' wiki-doc and wiki-word datasets
- Tested with up to 2.5m rows, 100m edges (processed in 12 hours).
- Neighbor search:
- Dense search is much, much faster and more efficient
- Tree search for cosine distances uses normalized vectors
- Should be 10x faster for small datasets
- Replaced binary search ( O(n log n) ) with the alias algorithm for weighted sampling ( O(1) )
- Clips and smooths gradients, per discussion with paper authors
- Optimized implementation for alpha == 1
- Removed option for mixing weights into loss function - doesn't make sense if gradients are being clipped.
- Fixed OpenMP-related bug which caused visualizations to be "fuzzy"
- Switched to the STL random number generator, allowing the user to set a seed for reproducible results.
+ Reuse initialization matrices and neighbors, to make it easier to see the effect of hyperparameters
+ Benchmarks now a separate vignette, more detailed
+ Examples removed from vignettes and moved to readme
+ Added examples of manifold map with color faces using OpenFace vectors
- Sigms, P_ij matrix, w_ij matrix
+ Replaced C++ code entirely with new code based on reference implementation
+ Refactored R code into
buildWijMatrix(), which are simpler.
+ Color manifold maps work
+ Ported Karpathy's function for non-overlapping embeddings (experimental)
+ Removed transparency parameter
+ Added ggManifoldMap function for adding a manifold map to a ggplot2 plot
+ vis function renamed largeVis
+ Whether to return neighbors now an adjustable parameter, for memory reasons
+ No longer return sigmas under any circumstance
+ Runs gc() periodically
- Removed most data and extdata that had been included before; this is to reduce size for CRAN submission
- Dependencies & Build
+ Many misc changes to simplify dependencies for CRAN
+ Re-added ARMA_64BIT_WORD; otherwise, could exceed the limitation on size of an arma sparse matrix with moderately sized datasets (~ 1 M rows, K = 100)
+ Now depends on R >= 3.0.2, so RcppProgress and RcppArmadillo could be moved from the Depends section of the DESCRIPTION file
+ Will now compile on systems that lack OpenMP (e.g., OS X systems with old versions of xcode).
- Correctness and Testing
+ Tests are separated by subject
+ Additional, more extensive tests with greater code coverage
+ Added travis testing against OSX
- Very preliminary support for dbscan and optics added, however these functions have not been exported.
- Handles substantially larger datasets
- Support for sparse matrices (for much larger datasets)
- Added better error reporting for tree search
- Handle situation in tree search where nodes are equidistant from the hyperplane
- Broke-out several components as separate functions, which makes a more-memory-efficient mode of operation possible
- Removed some unnecessary checking when processing neighbor graph
- RcppArmadillo 0.7.100.3.0 is now required (this was necessary for support for larger datasets)
- Added appveyor to check Windows compatibility
- Added option of Euclidean or Cosine distance.
- Now using the median in random projection trees, to make splits more even. This should eliminate the need for the
- Vastly improved multi-threading performance
- Added visualization function.
Initial development releases. Focused on correctness, performance, testing against larger datasets.
NEWS.md file to track changes to the package.