In the framework of Symbolic Data Analysis, a relatively new approach to the statistical analysis of multi-valued data, we consider histogram-valued data, i.e., data described by univariate histograms. The methods and the basic statistics for histogram-valued data are mainly based on the L2 Wasserstein metric between distributions, i.e., a Euclidean metric between quantile functions. The package contains unsupervised classification techniques, least square regression and tools for histogram-valued data and for histogram time series.
In this document we describe the main features of the HistDAWass package. The name is the acronym for Histogram-valued Data analysis using Wasserstein metric. The implemented classes and functions are related to the anlysis of data tables containing histograms in each cell instead of the classical numeric values.
In this document we describe the main features of the HistDAWass package. The name is the acronym for Histogram-valued Data analysis using Wasserstein metric. The implemented classes and functions are related to the anlysis of data tables containing histograms in each cell instead of the classical numeric values.
What is the L2 Wasserstein metric?
given two probability density functions (f) and (g), each one has a cumulative distribution function (F) and (G) and thei respectively quantile functions (the inverse of a cumulative distribution function) (Q_f) and (Q_g). The L2 Wasserstein distance is
[d_W(f,g)=\sqrt{\int\limits_0^1{(Q_f(p) - Q_g(p))^2 dp}}]
The implemented classes are those described in the following table
Class | wrapper function for initializing | Description |
---|---|---|
distributionH | distributionH(x,p) | A class describing a histogram distibution |
MatH | MatH(x, nrows, ncols,rownames,varnames, by.row ) | A class describing a matrix of distributions |
TdistributionH | TdistributionH() | A class derived from distributionH equipped with a timestamp or a time window |
HTS | HTS() | A class describing a Histgram-valued time series |
library(HistDAWass)mydist=distributionH(x=c(0,1,2),p=c(0,0.3,1))
From raw data to histograms =========================== Basic statistics for a distributionH (A histogram) ==================================================
The average hisogram of a column
The standard deviation of a variable
The covarince matrix of a MatH
The correlation matrix of a MatH
plot of a distributionH
plot of a MatH
plot of a HTS
Clustering
Kmeans
Adaptive distance based Kmeans
Fuzzy cmeans
Fuzzy cmeans based on adaptive Wasserstein distances
Kohonen batch self organizing maps
Kohonen batch self organizing maps with Wasserstein adaptive distances
Hierarchical clustering
Dimension reduction techniques
Principal components analysis of a single histogram variable
Principal components analysis of a set of histogram variables (using Multiple Factor Analysis)
Smoothing
Moving averages
Exponential smoothing
Predicting
A two component model for a linear regression using Least Square method