The goal of this vignette is to show most of all possibilies with aVT (for aVirtualTwins meaning adaptation of Virtual Twins method) package.
VT method (Jared Foster and al, 2011) has been created to find subgroup of patients with enhanced treatment effect, if it exists. Theorically, this method can be used for binary and continous outcome. This package only deals with binary outcome in a two arms clinical trial.
VT method is based on random forests and regression/classification trees.
I decided to use a simulated dataset called sepsis in order to show how aVT package can be used. Type ?sepsis
to know more about this dataset. Anyway, the true subgroup is PRAPACHE <= 26 & AGE <= 49.80
.
NOTE: This true subgroup is defined with the lower event rate (survival = 1
) in treatement arm. Therefore in following examples we’ll search the subgroup with the highest event rate, and we know it is PRAPACHE > 26 & AGE > 49.80
.
Data used in VT are modelized by \(\left\{Y, T, X_1, \ldots, X_{p-2}\right\}\). \(p\) is the number of variables.
factor
. Second level of this factor will be the desirable event. (\(Y=1\))NOTE: if you run VT with interactions, categorical covariables must be transformed into binary variables.
Type ?formatRCTDataset
for details.
Related functions/classes in aVirtualTwins package : VT.object()
, vt.data()
, formatRCTDataset
.
VT is a two steps method but with many possibilities
let \(\hat{P_{1i}} = P(Y_i = 1|T_i = 1, X_i)\)
let \(\hat{P_{0i}} = P(Y_i = 1|T_i = 0, X_i)\)
let \(X = \left\{X_1, \ldots, X_{p-2}\right\}\)
From one of these methods you can estimate \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\).
Related functions/classes in aVirtualTwins package : VT.difft()
, vt.forest()
.
Define \(Z_i = \hat{P_{1i}} - \hat{P_{0i}}\)
The idea is to identify which covariable from \(X\) described variation of \(Z\).
Related function in aVirtualTwins package : vt.tree()
.
See Introduction.
Sepsis dataset is a simulated clinical trial with two groups treatment about sepsis desease. See details. This dataset is taken from SIDES method
Sepsis contains simulated data on 470 subjects with a binary outcome survival, that stores survival status for patient after 28 days of treatment, value of 1 for subjects who died after 28 days and 0 otherwise. There are 11 covariates, listed below, all of which are numerical variables.
Note that contrary to the original dataset used in SIDES, missing values have been imputed by random forest randomForest::rfImpute()
. See file data-raw/sepsis.R for more details.
True subgroup is PRAPACHE <= 26 & AGE <= 49.80
. NOTE: This subgroup is defined with the lower event rate (survival = 1) in treatement arm.
470 patients and 13 variables:
survival
: binary outcomeTHERAPY
: 1 for active treatment, 0 for control treatmentTIMFIRST
: Time from first sepsis-organ fail to start drugAGE
: Patient age in yearsBLLPLAT
: Baseline local plateletsblSOFA
: Sum of baselin sofa (cardiovascular, hematology, hepaticrenal, and respiration scores)BLLCREAT
: Base creatinineORGANNUM
: Number of baseline organ failuresPRAPACHE
: Pre-infusion apache-ii scoreBLGCS
: Base GLASGOW coma scale scoreBLIL6
: Baseline serum IL-6 concentrationBLADL
: Baseline activity of daily living scoreBLLBILI
: Baseline local bilirubinSource: http://biopharmnet.com/subgroup-analysis-software/
In order to begin the two steps of VT method, aVirtualTwins package needs to be initialized with vt.data()
function. type ?vt.data
for more details.
NOTE: if running VT with interactions between \(T\) and \(X\), set interactions = TRUE
.
Code of vt.data()
:
vt.data <- function(dataset, outcome.field, treatment.field, interactions = TRUE, ...){
data <- formatRCTDataset(dataset, outcome.field, treatment.field, interactions = TRUE)
VT.object(data = data, ...)
}
Example with Sepsis
# load library VT
library(aVirtualTwins)
# load data sepsis
data(sepsis)
# initialize VT.object
vt.o <- vt.data(sepsis, "survival", "THERAPY", TRUE)
## "1" will be the favorable outcome
1 will be the favorable outcome because 1 is the second level of "survival"
column. It means that \(P(Y=1)\) is the probability of interest. Anyway, it’s still possible to compute \(P(Y=0)\).
Quick example
Sepsis does not have any categorical variable, following example show how vt.data
deals with categorical values depending on interactions
parameter
# Creation of categorical variable
cat.x <- rep(1:5, (nrow(sepsis))/5)
cat.x <- as.factor(cat.x)
sepsis.tmp <- cbind(sepsis, cat.x)
vt.o.tmp <- vt.data(sepsis.tmp, "survival", "THERAPY", TRUE)
## "1" will be the favorable outcome
## Creation of dummy variables for cat.x
## Dummy variable cat.x_1 created
## Dummy variable cat.x_2 created
## Dummy variable cat.x_3 created
## Dummy variable cat.x_4 created
## Dummy variable cat.x_5 created
Dummies variables are created for each category of cat.x
variable. And cat.x
is removed from dataset.
As described earlier, step 1 can be done via differents ways
Following example used sepsis data created in previous part.
To perform simple random forest on VT.object
, randomForest
, caret
and party
package can be used.
Class vt.forest("one", ...)
is used. It takes in arguments :
forest.type
: you have to set it to "one"
vt.data
: return of vt.data()
functionmodel
: a random forest modelinteractions
: logical, TRUE
is default value...
: options to randomForest()
functionwith randomForest
# use randomForest::randomForest()
library(randomForest, verbose = F)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
# Reproducibility
set.seed(123)
# Fit rf model
# default params
# set interactions to TRUE if using interaction between T and X
model.rf <- randomForest(x = vt.o$getX(interactions = T),
y = vt.o$getY(),
ntree = 500)
# initialize VT.forest.one
vt.f.rf <- vt.forest("one", vt.data = vt.o, model = model.rf, interactions = T)
### or you can use randomForest inside vt.forest()
vt.f.rf <- vt.forest("one", vt.data = vt.o, interactions = T, ntree = 500)
with party
cforest()
can be usefull however computing time is really long. I think there is an issue when giving cforest object in Reference Class parameter. Need to fix it.
# # use randomForest::randomForest()
# library(party, verbose = F)
# # Reproducibility
# set.seed(123)
# # Fit cforest model
# # default params
# # set interactions to TRUE if using interaction between T and X
# model.cf <- cforest(formula = vt.o$getFormula(), data = vt.o$getData(interactions = T))
# # initialize VT.forest.one
# vt.f.cf <- vt.forest("one", vt.data = vt.o, model = model.cf)
with caret
Using caret
can be usefull to deal with parallel computing for example.
NOTE: For caret
levels of outcome can’t be 0, so i’ll change levels name into “n”/“y”
# Copy new object
vt.o.tr <- vt.o$copy()
# Change levels
tmp <- ifelse(vt.o.tr$data$survival == 1, "y", "n")
vt.o.tr$data$survival <- as.factor(tmp)
rm(tmp)
# Check new data to be sure
formatRCTDataset(vt.o.tr$data, "survival", "THERAPY")
## "y" will be the favorable outcome
# use caret::train()
library(caret, verbose = F)
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
# Reproducibility
set.seed(123)
# fit train model
fitControl <- trainControl(classProbs = T, method = "none")
model.tr <- train(x = vt.o.tr$getX(interactions = T),
y = vt.o.tr$getY(),
method = "rf",
tuneGrid = data.frame(mtry = 5),
trControl = fitControl)
# initialize VT.forest.one
vt.f.tr <- vt.forest("one", vt.o.tr, model = model.tr)
To perform double random forest on VT.object
, same packages as simple random forest can be used.
Function vt.forest("double", ...)
is used. It takes in arguments :
forest.type
: You have to set is to "double"
vt.data
: return of vt.data()
functionmodel_trt1
: a random forest model for \(T=1\) (this argument has to be specified)model_trt0
: a random forest model for \(T=0\) (this argument has to be specified)NOTE: use trt
parameter in VT.object::getX()
or VT.object::getY()
methods to obtain part of data depending on treatment. See following example.
with randomForest
# grow RF for T = 1
model.rf.trt1 <- randomForest(x = vt.o$getX(trt = 1),
y = vt.o$getY(trt = 1))
# grow RF for T = 0
model.rf.trt0 <- randomForest(x = vt.o$getX(trt = 0),
y = vt.o$getY(trt = 0))
# initialize VT.forest.double()
vt.doublef.rf <- vt.forest("double",
vt.data = vt.o,
model_trt1 = model.rf.trt1,
model_trt0 = model.rf.trt0)
### Or you can use randomForest() inside
vt.doublef.rf <- vt.forest("double",
vt.data = vt.o,
ntree = 200)
Follow the same structure for caret
or cforest
models.
This idea is taken from method 3 of Jared Foster paper :
A modification of [previous methods] is to obtain \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\) via cross-validation. In this méthod the specific data for subject \(i\) is not used to obtain \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\). Using k-fold cross-validation, we apply random forest regression approach to \(\frac{k-1}{k}\) of the data and use the resulting predictor to obtain estimates of \(P_{1i}\) and \(P_{0i}\) for the remaining \(\frac{1}{k}\) of the observations. This is repeated \(k\) times.
To use this approach, use vt.forest("fold", ...)
. This class takes in argument :
forest.type
: it has to be set to "fold"
vt.data
: return of vt.data()
functionfold
: number of fold (e.g. \(5\))ratio
: Control of sampsize balance. ratio
of \(2\) means that there 2 times le highest level compared to the other. “Highest” means the level with larger observations. It’s in test.interactions
: Logical. If TRUE
, interactions between covariables and treatments are used. FALSE
otherwise....
: randomForest()
function optionsNOTE: This function use only randomForest
package.
# initialize k-fold RF
# you can use randomForest options
model.fold <- vt.forest("fold", vt.data = vt.o, fold = 5, ratio = 1, interactions = T, ntree = 200)
Random Forests are not the only models you can use to compute \(\hat{P_{1i}}\) and \(\hat{P_{0i}}\). Any prediction model can be used, as logitic regression, boosting …
Anyway, aVirtualTwins package can be used. To do so, you can use VT.difft()
class. It is important to note this the parent class of all “forests” classes. It takes in argument :
vt.object
: return of vt.data()
functiontwin1
: estimate of \(P(Y_{i} = 1 | T = T_{i})\) : meaning response probability under the correct treatment.twin2
: estimate of \(P(Y_{i} = 1 | T = 1-T_{i})\) : meaning response probability under the other treatment.method
: absolute (default), relative or logit. See ?VT.difft
for details.# you get twin1 and twin2 by your own method
# here, i'll use random number between 0 and 1 :
twin1_random <- runif(470)
twin2_random <- runif(470)
# then you can initialize VT.difft class :
model.difft <- VT.difft(vt.o, twin1 = twin1_random, twin2 = twin2_random, "absolute")
# compute difference of twins :
model.difft$computeDifft()
# See results
head(model.difft$difft)
## [1] 0.4883771 0.7632997 -0.2555044 -0.1270976 -0.4710701 -0.1426843
# Graph :
# hist(model.difft$difft)
NOTE: Also, you can clone repository, write your own child class of VT.difft()
AND submit it !
As described in the method, we define \(Z_i = \hat{P_{1i}} - \hat{P_{0i}}\). It’s the difference in term of response of the active treatments compared to the control treatment. The idea is to try to explain this difference by few covariables.
We define a new variable \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise. Classification tree’s goal is to explain the value \(Z^*=1\). \(c\) is a threshold given by the user. It’s the threshold for which the difference is “interesting”. One idea is to use quantiles of the difft distribution.
To compute a classifiction tree, vt.tree("class", ...)
is used. Internally, rpart::rpart()
is computed. It takes in argument:
tree.type
: it has to be set to "class"
vt.difft
: VT.difft
object (return of vt.forest()
function)sens
: c(">", "<")
. sens
corresponds to the way \(Z^{*}\) is defined.
">"
(default) : \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i > c\) and \(Z^{*}_i=0\) otherwise."<"
: \(Z^{*}\), \(Z^{*}_i=1\) if \(Z_i < c\) and \(Z^{*}_i=0\) otherwise.threshold
: corresponds to \(c\), it can be a vector. \(seq(.5, .8, .1)\) by default.screening
: NULL
is default value. If TRUE
only covariables in varimp
vt.data
’s field is used.See ?VT.tree
for details.
# initialize classification tree
tr.class <- vt.tree("class",
vt.difft = vt.f.rf,
sens = ">",
threshold = quantile(vt.f.rf$difft, seq(.5, .8, .1)),
maxdepth = 3,
cp = 0,
maxcompete = 2)
# tr.class is a list if threshold is a vectoor
class(tr.class)
## [1] "list"
# acce trees with treeXX
class(tr.class$tree1)
## [1] "VT.tree.class"
## attr(,"package")
## [1] "aVirtualTwins"
Use regression tree to explain \(Z\) by covariables \(X\). Then some leafs have predicted \(Z_i\) greater than the threshold \(c\) (if \(sens\) is “>”), and it defines which covariables explain \(Z\).
The function to use is vt.tree("reg", ...)
. It takes same parameters than classification mehod.
# initialize regression tree
tr.reg <- vt.tree("reg",
vt.difft = vt.f.rf,
sens = ">",
threshold = quantile(vt.f.rf$difft, seq(.5, .8, .1)))
# tr.class is a list if threshold is a vectoor
class(tr.reg)
## [1] "list"
# access trees with treeXX
class(tr.reg$tree1)
## [1] "VT.tree.reg"
## attr(,"package")
## [1] "aVirtualTwins"
Once trees have been computed, you surely want to see what are the subgroups. This package provides a wrapper function of intern methods of VT.tree
class : vt.subgroups()
.
This function takes in argument :
vt.tree
: object or list of class VT.tree
. Return of the vt.tree()
function.only.leaf
: logical. Set TRUE
(default) to visualize only terminal nodes.only.fav
: logical. Set TRUE
(default) to visualize only class 1 nodes. (\(\hat{A}\))tables
: logical. Set FALSE
(default) to prevent tables of incidences from being printed.verbose
: logical. Set FALSE
(default) to prevent detailed stuffs from being printed.compete
: logical. Set TRUE
to print competitors rules thanks to competitors split. FALSE
is default value.If vt.tree
is a list, unique subgroups are printed.
# use tr.class computed previously
vt.sbgrps <- vt.subgroups(tr.class)
# print tables with knitr package
library(knitr)
knitr::kable(vt.sbgrps)
Subgroup | Subgroup size | Treatement event rate | Control event rate | Treatment sample size | Control sample size | RR (resub) | RR (snd) | |
---|---|---|---|---|---|---|---|---|
tree1.11 | PRAPACHE< 26.5 & AGE>=51.74 & BLGCS< 11.5 | 49 | 0.485 | 0.375 | 33 | 16 | 1.293 | 1.043 |
tree1.3 | PRAPACHE>=26.5 | 157 | 0.752 | 0.327 | 105 | 52 | 2.300 | 1.865 |
tree2.11 | PRAPACHE< 26.5 & BLGCS< 10.5 & AGE>=64.68 | 14 | 0.7 | 0.5 | 10 | 4 | 1.400 | 1.200 |
tree2.13 | PRAPACHE>=26.5 & AGE< 51.74 & BLLCREAT< 2.3 | 21 | 0.375 | 0.6 | 16 | 5 | 0.625 | 1.506 |
tree2.7 | PRAPACHE>=26.5 & AGE>=51.74 | 120 | 0.897 | 0.31 | 78 | 42 | 2.894 | 2.007 |
tree3.13 | PRAPACHE>=26.5 & AGE< 51.74 & BLLBILI>=2.95 | 13 | 0.333 | 0.25 | 9 | 4 | 1.332 | 1.564 |
tree4 | PRAPACHE>=26.5 & AGE>=51.74 & PRAPACHE>=28.5 | 89 | 0.915 | 0.333 | 59 | 30 | 2.748 | 2.087 |
You can plot one tree with package rpart.plot
## Loading required package: rpart
If you want to see competitors split :
tr.class$tree2$createCompetitors()
head(tr.class$tree2$competitors)
## count ncat improve index var path string
## 1 470 -1 119.987009 26.500 PRAPACHE 2 PRAPACHE < 26.5
## 2 470 1 37.018261 10.500 BLGCS 2 BLGCS >= 10.5
## 3 470 -1 30.491058 61.671 AGE 2 AGE < 61.671
## 9 313 1 6.349646 10.500 BLGCS 4 BLGCS >= 10.5
## 10 313 -1 5.075061 61.671 AGE 4 AGE < 61.671
## 11 313 -1 2.027870 19.500 PRAPACHE 4 PRAPACHE < 19.5
If you want to print incidence of a subgroup :
vt.o$getIncidences("PRAPACHE >= 26 & AGE >= 52")
## $table.selected
## $table.selected$table
## trt
## resp 0 1 sum
## 0 30.000 18.000 48.000
## 1 15.000 72.000 87.000
## sum 45.000 90.000 135.000
## Incidence 0.333 0.800 0.644
##
## $table.selected$rr
## [1] 2.402402
##
##
## $table.not.selected
## $table.not.selected$table
## trt
## resp 0 1 sum
## 0 71.000 170.000 241.000
## 1 37.000 57.000 94.000
## sum 108.000 227.000 335.000
## Incidence 0.343 0.251 0.281
##
## $table.not.selected$rr
## [1] 0.7317784
# or
# tr.class$tree2$getIncidences("PRAPACHE >= 26 & AGE >= 52")
If you want to get infos about the tree
tr.class$tree2$getInfos()
##
## Threshold = 0.0675
## Delta = 0.0671
## Sens : >
## Size of Ahat : 155
# access Ahat
# tr.class$tree2$Ahat
You can re-run rpart computation:
tr.class$tree2$run(maxdepth = 2)
Type ?VT.tree
for details.
This vignette is a bit messy right now, therefore feel free to ask anything to the repository issue reports :
?aVirtualTwins