An R package for non-negative and sparse principal component analysis.
Principal component analysis (PCA) provides a lower dimensional approximation of high-dimensional data, where the reconstruction error (measured using Euclidean distance) is minimal. PCA is therefore useful to succinctly describe a data set which has an “intrinsic” dimensionality that is low, even though the number of features might be high. A classical example is the set of face images: even low-resolution images quickly exceed 1e4 pixels, yet about 50 principal components (PCs) are sufficient to achieve a good approximation of a face image (Kirby and Sirovich, 1990).
Although PCA often provides a good approximation with few PCs, each component is usually a linear combination of all features of the data. Enforcing sparsity of the principal axes (PA) facilitates identification of the relevant features and is therefore an unsupervised feature selection method. In applications where a fixed penalty is associated with each included feature (e.g. transaction costs in finance), a small loss in variance of the PC for a large reduction in cardinality of the PA can lead to an overall more desirable solution. Enforcing non-negativity of the PAs renders PCA applicable to domains where only positive influence of features is deemed appropriate, i.e. the total variance is explained additively. Non-negative solutions often show some degree of sparsity already, but a combination of both constraints enables precise control over the cardinality of the PAs.
This package implements two non-negative and/or sparse PCA algorithms which are rooted in expectation-maximization (EM) for a probabilistic generative model of PCA (Sigg and Buhmann, 2008). The nsprcomp
algorithm can also be described as applying a soft-thresholding operator to the well-known power iteration method for computing eigenvalues.
A small example from the domain of portfolio optimization is provided in this blog post, and a comparison to the arrayspc
algorithm from the elasticnet package can be found here.
In each EM (or power) iteration of the nsprcomp
algorithm, a soft thresholding operator is applied to enforce sparsity of the PA, and projection to the non-negative orthant is applied to enforce non-negativity. Once the EM procedure has converged and the support of the PA has been identified, the non-zero coefficients are recomputed to maximize the variance of the PC. Finally, because EM is a local optimizer, random restarts are employed to (hopefully) avoid bad local minima. The same iterative procedure (without the non-negativity constraint) was later also proposed by Journee et al. (2010).
The nsprcomp
algorithm computes one PC after the other. Because constrained PAs no longer correspond to true eigenvectors of the covariance matrix and are usually not pairwise orthogonal, special attention needs to be paid when computing more than a single PC. The algorithm implements the generalized deflation method proposed by Mackey (2009) to maximize the additional variance of each component. The variance of the PC is maximized after projecting the PA to the ortho-complement of the space spanned by the previous PAs. This procedure maximizes the additional variance not explained by previous components, and is identical to standard PCA if no sparsity or non-negativity constraints are enforced on the PAs.
The nsprcomp
algorithm is suitable for large and high-dimensional data sets, because it entirely avoids computing the covariance matrix. It is therefore especially suited to the case where the number of features exceeds the number of observations.
The nscumcomp
algorithm jointly computes all PCs, such that the cumulative variance of the components is maximized. The computation is again based on EM iterations, but here the maximization step has to be carried out numerically. Non-negativity of the PAs is achieved by enforcing a zero lower bound in the L-BFGS-B algorithm used for performing the maximization step. Sparsity of the PAs is achieved by a subsequent soft-thresholding of the pseudo-rotation matrix (the matrix which has the PAs as its columns). In contrast to sequential PCA algorithms such as nsprcomp
, only the total number k
of non-zero loadings is specified by the user, and the non-zero loadings are distributed over all PAs by the algorithm.
The algorithm penalizes the divergence of the pseudo-rotation matrix from ortho-normality using a Lagrange parameter gamma
. A positive gamma
enforces a minimum angle between the PAs, and is sometimes necessary to avoid PAs collapsing onto each other during the EM procedure.
The nscumcomp
algorithm also scales to large and high-dimensional data sets. But due to the numerical maximization of the M-step, it is computationally more involved than the nsprcomp
algorithm.
The nscumcomp
algorithm is currently unpublished.
Kirby, M., & Sirovich, L. (1990). Application of the Karhunen-Loeve procedure for the characterization of human faces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(1), 103-108.
Sigg, C. D., & Buhmann, J. M. (2008). Expectation-maximization for sparse and non-negative PCA. In Proceedings of the 25th International Conference on Machine Learning (pp. 960-967).
Journee, M., Nesterov, Y., Richtarik, P., & Sepulchre, R. (2010). Generalized power method for sparse principal component analysis. The Journal of Machine Learning Research, 11, 517-553.
Mackey, L. (2009). Deflation methods for sparse pca. Advances in Neural Information Processing Systems, 21, 1017-1024.