Pure set data visualization approaches are often limited in scalability due to the combinatorial explosion of distinct set families as the number of sets under investigation increases. hierarchicalSets applies a set centric hierarchical clustering of the sets under investigation and uses this hierarchy as a basis for a range of scalable visual representations. hierarchicalSets is especially well suited for collections of sets that describe comparable comparable entities as it relies on the sets to have a meaningful relational structure.

This is a package that facilitates hierarchical set analysis on large collections of sets.

*Hierarchical Set Analysis* is a way to investigate large numbers of sets. It
consists of two things: A novel hierarchical clustering algorithm for sets and
a range of visualizations that builds on top of the resulting clustering.

**The clustering**, in contrast to more traditional approaches to hierarchical
clustering, does not rely on a derived distance measure between the sets, but
instead works directly with the set data itself using set algebra. This has two
results: First, it is much easier to reason about the result in a set context,
and second, you are not forced to provide a distance if none exists (sets are
completely independent). The clustering is based on a generalization of Jaccard
similarity, called *Set Family Homogeneity*, that, in its simplest form, is
defined as the size of the intersection of the sets in the family divided by the
size of the union of the sets in the family. The clustering progresses by
iteratively merging the set families that shows the highest set family
homogeneity and is terminated once all remaining set family pairs have a
homogeneity of 0. Note that this means that the clustering does not necessarily
end with the all sets in one overall cluster, but possibly split into several
hierarchies - this is intentional.

**The visualizations** uses the derived clustering as a scaffold to show e.g.
intersection and union sizes of set combinations. By narrowing the number of
set families to visualize to the branch points of the hierarchy the number of
data points to show is linearly related to the number of sets under
investigation, instead of the exponential if we chose to show everything. This
means that hierachical set visualizations are much more scalable than e.g. UpSet
and Radial Sets, at the cost of only showing combinations of the progressively
most similar sets. Apart from intersection and union sizes there is a secondary
analysis build into hierarchical sets that detects the elements not conforming
to the imposed hierarchy. These *outlying elements* defines a subset of the
universe that deviates from the rest and can be quite interesting -
visualizations to investigate these are of course also provided.

Hierachical set analysis is obviuously sensible for collections of sets where a hierarchical structure makes sense, but even for set collection that does not obviously support a hierarchy, it can be interesting to look at how the different sets do, or do not, relate to each other.

The stable version is available on CRAN with
`install.packages('hierarchicalSets')`

. Alternatively the development version
can be obtained from GitHub using devtools:

if (!require(devtools)) { install.packages('devtools')}devtools::install_github('thomasp85/hierarchicalSets')

hierarchicalSets comes with a toy dataset containing the followers of 100
prolific anonymous twitter users. To create the hierarchical clustering you use
the `create_hierarchy()`

function.

library(hierarchicalSets)data('twitter') twitSet <- create_hierarchy(twitter)twitSet## A HierarchicalSet object## ## Universe size: 28459## Number of sets: 100## Number of independent clusters: 6

To simply have a look at the hierarchy you plot it:

plot(twitSet)

Here the x axis is encoded with the *Set Family Heterogeneity* which is the
inverse of the homogeneity. It can thus be interpreted as the ratio of union to
intersection size.

Usually we are interested in the direct numbers which can be provided with another plottype - the intersection stact

plot(twitSet, type = 'intersectStack', showHierarchy = TRUE)

We see that especially four sets are very similar, incidentally four of the largest sets. The rightmost cluster is interesting as well as we see it is held together by a very small overlap and that each individual set contains a lot of unique followers.

While it is expected that followers share some of the same patterns in terms of who they follow, it cannot be expected that they fully adhere to a single hierarchy. We can look at how sets are connected across the hierarchy by counting the number of elements they have in common that is not part of their closest shared set family. This can be shown using hierarchical edge bundles:

plot(twitSet, type = 'outlyingElements', quantiles = 0.8, alpha = 0.2)

It seems our four sets again draws attention to themselves by having strong connections to a range of other sets distant in terms of the clustering. Also it is obvious that there is two seperate groups of sets in terms of their deviation profile. There might be a secondary structure hidden within the outlying elements. Lets create a hierarchical set clustering based only on the outlying elements:

twitSet2 <- outlier_hierarchy(twitSet)twitSet2## A HierarchicalSet object## ## Universe size: 14657## Number of sets: 100## Number of independent clusters: 7

In this way it is possible to gradually shave off hierarchical structures, revealing the uncaptured relations of the prior analysis... Happy investigation!

Hierarchical Sets is static on purpose. I firmly belive that effective static plots are the foundation for any good visualization. You can always augment a static visualization with interactivity, but not all interactive visualizations can be used in a static way. That being said, it could be fun to experiment with said augmentation within a shiny app. Also, the implementation begins to struggle (just slow down actually) when being used with thousands of sets and millions of elements - some C++ wizardry might be warranted for such huge datasets.

Another nice idea I have in mind is to be able to keep set and element metadata within the HierarchicalSet object and seemlessly use it for plotting.

- Bug Fix: Spelling errors in class name - ugh...
- Bug Fix: Typecast hidden size_t to int to make it build on windows x64
- Feature: More utility functions. Quickly get size and members of clusters as well as membership for each set.

- Release on CRAN