library(oscar)
OSCAR (Optimal Subset CArdinality Regression) is an R-package delivering L0-quasinorm penalized regression models for Cox, logistic and gaussian model families. It comes with two different optimizers for the computationally demanding optimization task (DBDC and LMBM), and these have been implemented in Fortran. The R package handles all interaction with Fortran and provides many convenience and visualization functions to assist with L0-penalized regression modelling. This includes for example cross-validation, bootstrapping, and other diagnostics used to inspect an OSCAR model and assessing importance of features.
Please see Halkola et al. [1] for citing OSCAR. The primary solver for the DC-decomposition transforming the L0-pseudonorm optimization problem from discrete to continuous non-convex space is done via DBDC, which is described in detail on Joki et al. [2], although an alternate optimization method for larger problems is offered via LMBM.
A simulated real-world hospital cohort data with patient survival with prostate cancer (PCa) is provided with:
data(ex)
1:7,1:7] ex_X[
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE HB
## MEDISIMU1 29.39 175 90 142 77 68 10.9
## MEDISIMU2 19.37 176 60 107 77 58 13.3
## MEDISIMU3 23.88 165 65 142 77 71 11.8
## MEDISIMU4 34.54 176 107 126 90 71 13.1
## MEDISIMU5 20.65 188 73 142 77 71 15.3
## MEDISIMU6 28.41 174 86 188 77 88 12.8
## MEDISIMU7 29.63 180 96 142 77 81 14.1
head(ex_Y)
## Time Event
## MEDISIMU1 89 0
## MEDISIMU2 754 1
## MEDISIMU3 783 1
## MEDISIMU4 159 0
## MEDISIMU5 1322 0
## MEDISIMU6 200 1
For further details regarding this dataset and its generation, see reference Laajala et at. [3]
One key feature in OSCAR in addition to its L0 norm penalization is its capability to handle variables as groups, or ‘kits’. This grouping of variables causes them to be selected (or omitted) together. In many clinical applications this is often reality; for example, curating initial data for variables together can occur in cases such as:
In the example data from Turku University Hospital (TYKS), a series of clinical variables are run together. For example a standard blood panel includes red blood cell count, hematocrite, white blood cell count, etc. Additionally, single variables can be run separately; a very common single marker for PCa is Prostate-Specific Antigen (PSA).
The kit structure for such variables that can be obtained together are modelled using a kit indicator matrix:
1:7,1:7] ex_K[
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE HB
## BaseReport 1 1 1 1 1 1 0
## B-PVKT 0 0 0 0 0 0 1
## TKD 0 0 0 0 0 0 0
## cB-Het-Ion 0 0 0 0 0 0 0
## Pt-GFReEPI 0 0 0 0 0 0 0
## P-LD 0 0 0 0 0 0 0
## P-ASAT 0 0 0 0 0 0 0
apply(ex_K, MARGIN=1, FUN=sum) # Row indicator sums
## BaseReport B-PVKT TKD cB-Het-Ion Pt-GFReEPI P-LD P-ASAT
## 6 5 8 1 1 1 1
## P-Alb P-AFOS P-PSA P-Krea Pt-Krea-Cl B-Lymf P-Pi
## 1 1 1 1 1 1 1
## P-Ca P-ALAT S -Prot B-Eos S-Testo P-Gluk P-Mg
## 1 1 1 1 1 1 1
## P-Bil
## 1
apply(ex_K, MARGIN=2, FUN=sum) # Column indicator sums
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE
## 1 1 1 1 1 1
## HB HEMAT RBC WBC PLT LYMperLEU
## 1 1 1 1 1 1
## MONOperLEU NEU POT MONO BASOperLEU EOSperLEU
## 1 1 1 1 1 1
## NEUperLEU NA. CCRC LDH AST ALB
## 1 1 1 1 1 1
## ALP PSA CREAT CREACL LYM PHOS
## 1 1 1 1 1 1
## CA ALT TPRO EOS TESTO GLU
## 1 1 1 1 1 1
## MG TBILI
## 1 1
If the kit matrix is a diagonal indicator matrix with dimensions equal to the number of variables, each variable is L0-penalized without any grouping structure.
Further, these kits can be assigned a ‘price’. This may be a price literally, or a descriptive value indicating how hard the variable is to obtain. With this information, a clinically feasible pareto-front can be constructed, that simultaneously captures modelling generalization as well as clinical applicability.
A cost vector can be provided for the fitting procedure, and a standardized arbitrary unit cost vector is provided in:
head(ex_c)
## BaseReport B-PVKT TKD cB-Het-Ion Pt-GFReEPI P-LD
## 0.0 2.4 4.8 5.7 1.0 1.1
OSCAR comes with two different optimizers:
These are provided to the fitting procedure with solver=1
and solver=2
or with their abbreviations. They have been implemented in Fortran for computational efficiency.
A quick example run of OSCAR with LMBM together with the default kit structure for Cox regression:
set.seed(1)
oscar(x=ex_X, y=ex_Y, k=ex_K, w=ex_c, family="cox", solver="LMBM")
fit <- fit
A pareto-optimal front provides information on potential optimal saddle points, where model generalization and clinical applicability pair nicely:
oscar.pareto.visu(fit=fit)
Model convergence in target function is important, while an another axis can be plotted simultaneously to provide supporting information:
oscar.visu(fit, y=c("target", "cost"))
Fit model object can be inspected based on visual diagnostics such as cross-validation and bootstrapping:
# Perform 5-fold cross-validation to find out optimal k
oscar.cv(fit, fold=5, seed=123, solver=2)
cv <-# Visualize model generalization performance as a function of k
oscar.cv.visu(cv)
It appears that k-values around 5 or above present a shoulder-point saturate the model, with no further coefficients contributing to the model’s generalization capability.
Further, there is some uncertainty as to what order the coefficients should be non-zero, based on bootstrapping:
# Bootstrap original data 20 times (sampling with replacement and refitting)
oscar.bs(fit, bootstrap=20, seed=234, solver=2)
bs <-# Visualize bootstrapped models
oscar.bs.plot(fit=fit, bs=bs, nbins=20)
Based on the bootstrapping, PSA is a clear winner. Close second candidate for survival prediction is Alkaline Phosphatase (ALP). After these single markers the bootstrapping was quite prone to choosing separate kits between immunomarkers and standard bloodwork, with no clear consistent winner.
Useful commands for understanding the model and the oscar S4-object:
coef(fit, k=3) # All potential coefficients at cardinality k=3
## BMI HEIGHTBL WEIGHTBL SYSTOLICBP DIASTOLICBP PULSE
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## HB HEMAT RBC WBC PLT LYMperLEU
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## MONOperLEU NEU POT MONO BASOperLEU EOSperLEU
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## NEUperLEU NA. CCRC LDH AST ALB
## 0.0000000 0.0000000 -6.7487158 0.0000000 0.0000000 0.0000000
## ALP PSA CREAT CREACL LYM PHOS
## 0.6806596 0.4550213 0.0000000 0.0000000 0.0000000 0.0000000
## CA ALT TPRO EOS TESTO GLU
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## MG TBILI
## 0.0000000 0.0000000
feat(fit, k=3) # All features chosen at cardinality k=3
## CCRC ALP PSA
## -6.7487158 0.6806596 0.4550213
cost(fit, k=3) # Kit sum costs at various k cardinalities, here with cardinality k=3
## [1] 7.7
While oscar
-package was originally designed with biomedical survival data in mind, it naturally supports modelling of non-survival data. Currently, the family
parameter supports logistic regression for binary outcomes, and mse/gaussian/normal for modelling responses that are normally distributed.
# Use example swiss-data for quickness
data(swiss)
oscar(x=swiss[,-1], y=swiss[,1], family="mse", print=0, solver=2)
fit_swiss <-
# Plot model coefficients as a function of cardinality k
plot(fit_swiss)
Similarly bootstrapping and cross-validation:
# Bootstrap original data 50 times (sampling with replacement and refitting)
oscar.bs(fit_swiss, bootstrap=50, seed=1, solver=2)
bs_swiss <-# Visualize trajectories of bootstrapped coefficients
oscar.bs.plot(fit=fit_swiss, bs=bs_swiss, nbins=50)
# Perform 5-fold cross-validation to find out optimal k
oscar.cv(fit_swiss, fold=10, seed=2, solver=2)
cv_swiss <-# Visualize model generalization performance as a function of k
oscar.cv.visu(cv_swiss)
1: Halkola AS, Joki K, Mirtti T, Makela MM, Aittokallio T, Laajala, TD. OSCAR: Optimal subset cardinality regression using the L0-pseudonorm. To be submitted.
2: Joki K, Bagirov AM, Karmitsa N, Makela MM, Taheri S. Double Bundle Method for finding Clarke Stationary Points in Nonsmooth DC Programming. SIAM Journal on Optimization. 2018;28(2):1892–1919.
3: Laajala TD, Murtojarvi M, Virkki A, Aittokallio T. ePCR: an R-package for survival and time-to-event prediction in advanced prostate cancer, applied to real-world patient cohorts. Bioinformatics. 2018;34(22):3957-3959.
sessionInfo()
## R version 4.2.0 beta (2022-04-12 r82161 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=English_Finland.utf8
## [3] LC_MONETARY=English_Finland.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Finland.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] oscar_1.0.4
##
## loaded via a namespace (and not attached):
## [1] lattice_0.20-45 digest_0.6.29 grid_4.2.0 R6_2.5.1
## [5] jsonlite_1.8.0 magrittr_2.0.3 evaluate_0.15 highr_0.9
## [9] stringi_1.7.6 rlang_1.0.2 cli_3.3.0 jquerylib_0.1.4
## [13] Matrix_1.4-1 bslib_0.3.1 rmarkdown_2.13 splines_4.2.0
## [17] tools_4.2.0 stringr_1.4.0 hamlet_0.9.6 survival_3.3-1
## [21] xfun_0.30 yaml_2.3.5 fastmap_1.1.0 compiler_4.2.0
## [25] htmltools_0.5.2 knitr_1.38 sass_0.4.1