Table of Contents


Motivation

The T-Rex selector performs fast variable/feature selection in large-scale high-dimensional settings. It provably controls the false discovery rate (FDR), i.e., the expected fraction of selected false positives among all selected variables, at the user-defined target level. In addition to controlling the FDR, it also achieves a high true positive rate (TPR) (i.e., power) by maximizing the number of selected variables. It performs terminated-random experiments (T-Rex) using the T-LARS algorithm (R package) and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector can be applied in various fields, such as genomics, financial engineering, or any other field that requires a fast and FDR-controlling variable/feature selection method for large-scale high-dimensional settings.

\[ \DeclareMathOperator{\FDP}{FDP} \DeclareMathOperator{\FDR}{FDR} \DeclareMathOperator{\TPP}{TPP} \DeclareMathOperator{\TPR}{TPR} \newcommand{\A}{\mathcal{A}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\XWK}{\boldsymbol{\tilde{X}}} \newcommand{\C}{\mathcal{C}} \newcommand{\coloneqq}{\mathrel{\vcenter{:}}=} \]

Installation

Before installing the ‘TRexSelector’ package, you need to install the required ‘tlars’ package. You can install the ‘tlars’ package from CRAN or GitHub with:

# Install stable version from CRAN
install.packages("tlars")

# Install development version from GitHub
install.packages("devtools")
devtools::install_github("jasinmachkour/tlars")

Then, you can install the ‘TRexSelector’ package with:

devtools::install_github("jasinmachkour/TRexSelector")

You can open the help pages with:

library(TRexSelector)
help(package = "TRexSelector")
?trex
?random_experiments
?lm_dummy
?add_dummies
?add_dummies_GVS
?FDP
?TPP
# etc.

To cite the package ‘TRexSelector’ in publications use:

citation("TRexSelector")

Quick Start

This section illustrates the basic usage of the ‘TRexSelector’ package to perform FDR-controlled variable selection in large-scale high-dimensional settings based on the T-Rex selector.

  1. First, we generate a high-dimensional Gaussian data set with sparse support:
library(TRexSelector)

# Setup
n <- 75 # number of observations
p <- 150 # number of variables
num_act <- 3 # number of true active variables
beta <- c(rep(1, times = num_act), rep(0, times = p - num_act)) # coefficient vector
true_actives <- which(beta > 0) # indices of true active variables
num_dummies <- p # number of dummy predictors (also referred to as dummies)

# Generate Gaussian data
set.seed(123)
X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
y <- X %*% beta + stats::rnorm(n)
  1. Second, we perform FDR-controlled variable selection using the T-Rex selector for a target FDR of 5%:
# Seed
set.seed(1234)

# Numerical zero
eps <- .Machine$double.eps

# Variable selection via T-Rex
res <- trex(X = X, y = y, tFDR = 0.05, verbose = FALSE)
selected_var <- which(res$selected_var > eps)
paste0("True active variables: ", paste(as.character(true_actives), collapse = ", "))
#> [1] "True active variables: 1, 2, 3"
paste0("Selected variables: ", paste(as.character(selected_var), collapse = ", "))
#> [1] "Selected variables: 1, 2, 3"

So, for a preset target FDR of 5%, the T-Rex selector has selected all true active variables and there are no false positives in this example.

Note that users have to choose the target FDR according to the requirements of their specific applications.

FDR and TPR

False discovery rate (FDR) and true positive rate (TPR)

We give a mathematical definition of two important metrics in variable selection, i.e., the false discovery rate (FDR) and the true positive rate (TPR):

Definitions (FDR and TPR) Let \(\widehat{\A} \subseteq \lbrace 1, \ldots, p \rbrace\) be the set of selected variables, \(\A \subseteq \lbrace 1, \ldots, p \rbrace\) the set of true active variables, \(| \widehat{\A} |\) the cardinality of \(\widehat{\A}\), and define \(1 \lor a \coloneqq \max\lbrace 1, a \rbrace\), \(a \in \mathbb{R}\). Then, the false discovery rate (FDR) and the true positive rate (TPR) are defined by \[ \FDR \coloneqq \mathbb{E} \big[ \FDP \big] \coloneqq \mathbb{E} \left[ \dfrac{\big| \widehat{\A} \backslash \A \big|}{1 \lor \big| \widehat{\A} \big|} \right] \] and

\[ \TPR \coloneqq \mathbb{E} \big[ \TPP \big] \coloneqq \mathbb{E} \left[ \dfrac{| \A \cap \widehat{\A} |}{1 \lor | \A |} \right], \] respectively. Ideally, the \(\FDR = 0\) and the \(\TPR = 1\). In practice, this is not always possible. Therefore, the FDR is controlled on a sufficiently low level, while the TPR is maximized.

Simulations

Let us have a look at the behavior of the T-Rex selector for different choices of the target FDR. We conduct Monte Carlo simulations and plot the resulting averaged false discovery proportions (FDP) and true positive proportions (TPP) over the target FDR. Note that the averaged FDP and TPP are estimates of the FDR and TPR, respectively:

# Computations might take up to 10 minutes... Please wait... 

# Numerical zero
eps <- .Machine$double.eps

# Seed
set.seed(1234)

# Setup
n <- 100 # number of observations
p <- 150 # number of variables

# Parameters
num_act <- 10 # number of true active variables
beta <- rep(0, times = p) # coefficient vector (all zeros first)
beta[sample(seq(p), size = num_act, replace = FALSE)] <- 1 # coefficient vector (active variables with non-zero coefficients)
true_actives <- which(beta > 0) # indices of true active variables
tFDR_vec <- c(0.1, 0.15, 0.2, 0.25) # target FDR levels
MC <- 100 # number of Monte Carlo runs per stopping point

# Initialize results vectors
FDP <- matrix(NA, nrow = MC, ncol = length(tFDR_vec))
TPP <- matrix(NA, nrow = MC, ncol = length(tFDR_vec))

# Run simulations
for (t in seq_along(tFDR_vec)) {
  for (mc in seq(MC)) {
    
    # Generate Gaussian data
    X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
    y <- X %*% beta + stats::rnorm(n)
    
    # Run T-Rex selector
    res <- trex(X = X, y = y, tFDR = tFDR_vec[t], verbose = FALSE)
    selected_var <- which(res$selected_var > eps)
    
    # Results
    FDP[mc, t] <- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var))
    TPP[mc, t] <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives))
  }
}

# Compute estimates of FDR and TPR by averaging FDP and TPP over MC Monte Carlo runs
FDR <- colMeans(FDP)
TPR <- colMeans(TPP)

We observe that the T-Rex selector always controls the FDR (green line is always below the red and dashed reference line, i.e., maximum allowed value for the FDR). For more details and discussions, we refer the interested reader to the T-Rex paper [1].

The T-Rex Framework

The general steps that define the framework are illustrated in Figure 1. The key idea is to design randomized controlled experiments where fake variables, so-called dummies, act as a negative control group in the variable selection process.

Figure 1: Simplified overview of the T-Rex framework.

Figure 1: Simplified overview of the T-Rex framework.

Within the framework, a total of \(K\) random experiments with independently generated dummy matrices are conducted. Figure 2 shows the structure of the enlarged predictor matrix. Without loss of generality, true active variables (green), non-active (null) variables (red), and dummies (yellow) are illustrated as blocks within the predictor matrix. Note that this is only for visualization purposes and in practice the active and null variables are interspersed. In the random experiments, the dummy variables (yellow) compete with the given input variables in \(\X\) (green and red) to be included by a forward variable selection method, such as the LARS algorithm [2], the Lasso [3], or the elastic net [4]. In each random experiment, the solution path is terminated early, as soon as a pre-defined number of \(T\) dummies is included in the model. This results in the \(K\) candidate sets \(\C_{1, L}(T), \ldots, \C_{K, L}(T)\). The early stopping leads to a drastic reduction in computation time for sparse problems, where continuing the forward selection algorithm, beyond some point, only leads to including more null variables. Finally, a voting scheme is applied to the candidate sets which yields the final active set \(\widehat{\A}_{L}(v^{*}, T^{*})\). As detailed in [1], the calibration process ensures that the FDR is controlled at the user-defined level \(\alpha\) while maximizing the TPR by determining the optimal voting level \(v^{*}\) and number of included dummies \(T^{*}\) after which the forward selection process is terminated.

Figure 2: The enlarged predictor matrix (predictor matrix with dummies).

Figure 2: The enlarged predictor matrix (predictor matrix with dummies).

For a more detailed description of Figures 1 and 2 and more details on the T-Rex selector in general, we refer the interested reader to the original paper [1].

References

[1]
Machkour, J., Muma, M. and Palomar, D. P. (2021). The Terminating-Knockoff filter: Fast high-dimensional variable selection with false discovery rate control. arXiv preprint arXiv:2110.06048.
[2]
Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–99.
[3]
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B. Stat. Methodol. 58 267–88.
[4]
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301–20.