Introduction to aorsf

Accelerated ORSF

The purpose of aorsf (‘a’ is short for accelerated) is to provide routines to fit ORSFs that will scale adequately to large data sets. The fastest algorithm available in the package is the accelerated ORSF model, which is the default method used by orsf():


library(aorsf)

set.seed(329)

orsf_fit <- orsf(data = pbc_orsf, 
                 formula = Surv(time, status) ~ . - id)

orsf_fit
#> ---------- Oblique random survival forest
#> 
#>      Linear combinations: Accelerated
#>           N observations: 276
#>                 N events: 111
#>                  N trees: 500
#>       N predictors total: 17
#>    N predictors per node: 5
#>  Average leaves per tree: 24
#> Min observations in leaf: 5
#>       Min events in leaf: 1
#>           OOB stat value: 0.84
#>            OOB stat type: Harrell's C-statistic
#>      Variable importance: anova
#> 
#> -----------------------------------------

you may notice that the first input of aorsf is data. This is a design choice that makes it easier to use orsf with pipes (i.e., %>% or |>). For instance,

library(dplyr)

orsf_fit <- pbc_orsf |> 
 select(-id) |> 
 orsf(formula = Surv(time, status) ~ .)

Interpretation

aorsf includes several functions dedicated to interpretation of ORSFs, both through estimation of partial dependence and variable importance.

Variable importance

aorsf provides multiple ways to compute variable importance.

To compute negation importance, ORSF multiplies each coefficient of that variable by -1 and then re-computes the out-of-sample (sometimes referred to as out-of-bag) accuracy of the ORSF model.


orsf_vi_negate(orsf_fit)
#>          bili           age       protime        copper       ascites 
#>  0.0143779954  0.0096374245  0.0087518233  0.0061992082  0.0056261721 
#>         stage           sex           ast         edema       albumin 
#>  0.0050531361  0.0047926651  0.0034903105  0.0030512309  0.0023963326 
#>       spiders          trig          chol        hepato      platelet 
#>  0.0009376954 -0.0008856012 -0.0014065430 -0.0024484268 -0.0028130861 
#>           trt      alk.phos 
#> -0.0029172744 -0.0033861221

You can also compute variable importance using permutation, a more classical approach.


orsf_vi_permute(orsf_fit)
#>          bili           age         stage       protime       albumin 
#>  0.0106793082  0.0091685768  0.0037507814  0.0035424047  0.0029693686 
#>       ascites         edema       spiders          chol           ast 
#>  0.0028130861  0.0021296599  0.0018753907  0.0016670140  0.0015107314 
#>        copper      platelet          trig           sex        hepato 
#>  0.0003125651  0.0002083767 -0.0007293186 -0.0010418837 -0.0011460721 
#>           trt 
#> -0.0026047093

A faster alternative to permutation and negation importance is ANOVA importance, which computes the proportion of times each variable obtains a low p-value (p < 0.01) while the forest is grown.


orsf_vi_anova(orsf_fit)
#>    ascites       bili      edema     copper        age    albumin    protime 
#> 0.36650652 0.28964613 0.25605884 0.20384514 0.18876042 0.17611863 0.15949300 
#>      stage       chol        ast        sex     hepato    spiders       trig 
#> 0.15136967 0.14789292 0.13086093 0.11822315 0.11744654 0.11611076 0.10882902 
#>   alk.phos   platelet        trt 
#> 0.09948586 0.08489136 0.07035033

Partial dependence (PD)

Partial dependence (PD) shows the expected prediction from a model as a function of a single predictor or multiple predictors. The expectation is marginalized over the values of all other predictors, giving something like a multivariable adjusted estimate of the model’s prediction.

For more on PD, see the vignette

Individual conditional expectations (ICE)

Unlike partial dependence, which shows the expected prediction as a function of one or multiple predictors, individual conditional expectations (ICE) show the prediction for an individual observation as a function of a predictor.

For more on ICE, see the vignette

What about the original ORSF?

The original ORSF (i.e., obliqueRSF) used glmnet to find linear combinations of inputs. aorsf allows users to implement this approach using the orsf_control_net() function:


orsf_net <- orsf(data = pbc_orsf, 
                 formula = Surv(time, status) ~ . - id, 
                 control = orsf_control_net(),
                 n_tree = 50)

net forests fit a lot faster than the original ORSF function in obliqueRSF. However, net forests are still much slower than cph ones:


# tracking how long it takes to fit 50 glmnet trees
print(
 t1 <- system.time(
  orsf(data = pbc_orsf, 
       formula = Surv(time, status) ~ . - id, 
       control = orsf_control_net(),
       n_tree = 50)
 )
)
#>    user  system elapsed 
#>    2.45    0.00    2.46

# and how long it takes to fit 50 cph trees
print(
 t2 <- system.time(
  orsf(data = pbc_orsf, 
       formula = Surv(time, status) ~ . - id, 
       control = orsf_control_cph(),
       n_tree = 50)
 )
)
#>    user  system elapsed 
#>    0.05    0.00    0.05

t1['elapsed'] / t2['elapsed']
#> elapsed 
#>    49.2

aorsf and other machine learning software

The unique feature of aorsf is its fast algorithms to fit ORSF ensembles. RLT and obliqueRSF both fit oblique random survival forests, but aorsf does so faster. ranger and randomForestSRC fit survival forests, but neither package supports oblique splitting. obliqueRF fits oblique random forests for classification and regression, but not survival. PPforest fits oblique random forests for classification but not survival.

Note: The default prediction behavior for aorsf models is to produce predicted risk at a specific prediction horizon, which is not the default for ranger or randomForestSRC. I think this will change in the future, as computing time independent predictions with aorsf could be helpful.