Dimensionality reduction techniques for binary data including logistic PCA.

`logisticPCA`

is an R package for dimensionality reduction of binary data. Please note that it is still in the very early stages of development and the conventions will possibly change in the future. A manuscript describing logistic PCA can be found here.

To install R, visit r-project.org/.

The package can be installed by downloading from CRAN.

install.packages("logisticPCA")

To install the development version, first install `devtools`

from CRAN. Then run the following commands.

# install.packages("devtools")library("devtools")install_github("andland/logisticPCA")

Three types of dimensionality reduction are given. For all the functions, the user must supply the desired dimension `k`

. The data must be an `n x d`

matrix comprised of binary variables (i.e. all `0`

's and `1`

's).

`logisticPCA()`

estimates the natural parameters of a Bernoulli distribution in a lower dimensional space. This is done by projecting the natural parameters from the saturated model. A rank-`k`

projection matrix, or equivalently a `d x k`

orthogonal matrix `U`

, is solved for to minimize the Bernoulli deviance. Since the natural parameters from the saturated model are either negative or positive infinity, an additional tuning parameter `m`

is needed to approximate them. You can use `cv.lpca()`

to select `m`

by cross validation. Typical values are in the range of `3`

to `10`

.

`mu`

is a main effects vector of length `d`

and `U`

is the `d x k`

loadings matrix.

`logisticSVD()`

estimates the natural parameters by a matrix factorization. `mu`

is a main effects vector of length `d`

, `B`

is the `d x k`

loadings matrix, and `A`

is the `n x k`

principal component score matrix.

`convexLogisticPCA()`

relaxes the problem of solving for a projection matrix to solving for a matrix in the `k`

-dimensional Fantope, which is the convex hull of rank-`k`

projection matrices. This has the advantage that the global minimum can be obtained efficiently. The disadvantage is that the `k`

-dimensional Fantope solution may have a rank much larger than `k`

, which reduces interpretability. It is also necessary to specify `m`

in this function.

`mu`

is a main effects vector of length `d`

, `H`

is the `d x d`

Fantope matrix, and `U`

is the `d x k`

loadings matrix, which are the first `k`

eigenvectors of `H`

.

Each of the classes has associated methods to make data analysis easier.

`print()`

: Prints a summary of the fitted model.`fitted()`

: Fits the low dimensional matrix of either natural parameters or probabilities.`predict()`

: Predicts the PCs on new data. Can also predict the low dimensional matrix of natural parameters or probabilities on new data.`plot()`

: Either plots the deviance trace, the first two PC loadings, or the first two PC scores using the package`ggplot2`

.

In addition, there are functions for performing cross validation.

`cv.lpca()`

,`cv.lsvd()`

,`cv.clpca()`

: Run cross validation over the rows of the matrix to assess the fit of`m`

and/or`k`

.`plot.cv()`

: Plots the results of the`cv()`

method.

- Changed
`M`

to`m`

in the functions, since that is what it is called in the references - Switched to the
`rARPACK`

package from`irlba`

for partial eigen and singular value decomposition - Fixed incompatibility with updated version of
`testthat`