Supervised Principal Components for regression and survival analysis.
=============== ### Description
Does prediction in the case of a censored survival outcome, or a regression outcome, using the “supervised principal component” approach (Bair et al., 2006). Superpc is especially useful for high-dimensional data when the number of features p dominates the number of samples n (p >> n paradigm), as generated, for instance, by high-throughput technologies.
=============== ### Details
Supervised principal components is a generalization of principal components regression. The first (or first few) principal components are the linear combinations of the features that capture the directions of largest variation in a dataset. But these directions may or may not be related to an outcome variable of interest. To find linear combinations that are related to an outcome variable, we compute univariate scores for each gene and then retain only those features whose score exceeds a threshold. A principal components analysis is carried out using only the data from these selected features. Finally, these “supervised principal components” are used in a regression model to predict the outcome. To summarize, the steps are:
This idea can be used in standard regression problems with a quantitative outcome, and also in generalized regression problems such as survival analysis. In the latter problem, the regression coefficients in step (1) are obtained from a proportional hazards model. The superpc R package handles these two cases: standard regression and survival data.
There is one more important point: the features (e.g genes) which important in the prediction are not necessarily the ones that passed the screen in step 2. There are other features that may have as high a correlation with the supervised PC predictor. So we compute an importance score for each feature equal to its correlation with the supervised PC predictor. A reduced predictor is formed by soft-thresholding the importance scores, and using these shrunken scores as weights. The soft-thresholding sets the weight of some features to zero, hence throwing them out of the model. The amount of shrinkage is determined by cross-validation. The reduced predictor often performs as well or better than than the supervised PC predictor, and is more interpretable.
============ ### Branches
This branch (master) is the default one, that hosts the current development release (version 1.12).
=========== ### License
Package superpc is open source / free software, licensed under the GNU General Public License version 3 (GPLv3), sponsored by the Free Software Foundation. To view a copy of this license, visit GNU Free Documentation License.
============= ### Downloads
CRAN downloads since initial release to CRAN (2004-09-16): as tracked by RStudio CRAN mirror
CRAN downloads in the last month:
CRAN downloads in the last week:
================ ### Requirements
superpc (>= 1.12) requires R-3.5.0 (2018-04-23). It was built and tested under R version 4.0.3 (2020-10-10) and Travis CI.
Installation has been tested on Windows, Linux, OSX and Solaris platforms.
================ ### Installation
superpc
, simply download and install the current version (1.12) from the CRAN repository:install.packages("superpc")
superpc
from the GitHub repository:install.packages("devtools")
library("devtools")
devtools::install_github("jedazard/superpc")
========= ### Usage
library("superpc")
superpc.news()
citation("superpc")
etc…
================== ### Website - Wiki
=================== ### Acknowledgments
Authors: + Eric Bair, Ph.D. ebair@email.unc.edu + Trevor Hastie, Ph.D. hastie@stanford.edu + Debashis Paul, Ph.D. debpaul@ucdavis.edu + Robert Tibshirani, Ph.D. tibs@stanford.edu
Maintainers: + Jean-Eudes Dazard, Ph.D. jean-eudes.dazard@case.edu
Funding/Provision/Help:
+ This work made use of the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Case Western Reserve University. + Eric Bair was supported by an NSF graduate research fellowship. Robert Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183. Hastie was supported in part by National Science Foundation grant DMS-02-04612 and National Institutes of Health grant R01 CA 72028-07.
============== ### References
Bair E. and Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol (2004), 2(4):e108.
Bair E., Hastie T., Paul D., and Tibshirani R. Prediction by supervised principal components. J. Am. Stat. Assoc (2006), 101(473):119-137.