Abstract
In this vignette, we go through creating balanced partitions for
train/test sets and balanced folds for cross-validation with the R
package groupdata2
. We write a simple cross-validation
function and use it on some linear regression models.
groupdata2
is a set of methods for easy grouping,
windowing, folding, partitioning and splitting of data.
For a more extensive description of groupdata2
, please see
Description of groupdata2
Contact author at r-pkgs@ludvigolsen.dk
In this vignette, we will train a couple of linear regression models on some data. We will use cross-validation with balanced folds to find the best model, and then test that model on a subset of the original data that we haven’t used for training the model.
Our first task will be to split the dataset into a training set and a
test set using partition()
. Then, we will create folds
using the fold()
function. We will code up a simple
cross-validation function and put some models to the test!
While we will be writing our own code for cross-validation, it is
recommended to use the cvms
package for this functionality, when working with linear or logistic
regression models.
partition() creates (optionally) balanced partitions (e.g. train/test sets) from given group sizes. It can balance partitions on one categorical variable (e.g. diagnosis) and/or one numerical variable (e.g. age). It is also able to keep all datapoints with a shared ID (e.g. participant_id) in the same partition.
fold() creates (optionally) balanced folds for cross-validation. It can balance folds on one categorical variable (e.g. diagnosis) and/or one numerical variable (e.g. age). It is also able to keep all datapoints with a shared ID (e.g. participant_id) in the same fold. If we want to run repeated cross-validation, it can create multiple unique fold columns at once.
The essence of cross-validation is to test a model against data that it hasn’t been trained on, i.e. estimating out-of-sample error. It is done by first dividing the data into groups called folds.
Say we choose to divide the data into 5 folds. Then, in the first iteration, we train a model on the first four folds and test it on the fifth fold. In the second iteration, we then train on folds 2,3,4,5 and test on fold 1. We continue changing which fold is the test fold until all folds have been test folds (i.e. we train and test 5 times in total). In the end we get the average performance of the models and compare these to other cross-validated models.
The model with the lowest average error is believed to be the best at predicting unseen data from the same population(s) and thus chosen for further interpretation / use.
This is a great technique, and fold()
makes it even more
powerful.
Even with cross-validation, we have a risk of overfitting our models to our data. That is, while we get really good predictions for our current data, it won’t generalize to new data that we might gather in the future.
To test if this is the case, we store some of the data and don’t touch it before we feel confident that we have found the best model. Then we use the model to predict the targets / dependent variable in that data. If the results are much worse on the test set, it likely means that our model is suffering from overfitting! Then we might go back and adjust the models, until we’re pretty confident again. We might also change the number of folds or run repeated cross-validation instead.
Generally though, we should use the test set as few times (>1) as possible, as running this cycle too many times could accidentally find a model that is a good predictor of the specific datapoints in the test set, but not in general.
If the test set results are instead somewhat similar to the cross-validation results, these are the results that we report (possibly along with the cross-validation results).
Let’s say, we have scored 10 participants with either of two
diagnoses 'a'
and 'b'
on a very interesting
task, that you are free to call ‘the task’.
# Attach some packages
library(groupdata2)
library(dplyr)
library(ggplot2)
library(knitr) # kable()
library(lmerTest) #lmer()
library(broom) #tidy()
library(hydroGOF) # rmse()
# Create data frame
<- data.frame("participant" = factor(as.integer(
df rep(c('1','2', '3', '4', '5',
'6', '7', '8', '9', '10'), 3))),
"age" = rep(c(20,23,27,21,32,31,43,21,34,32), 3),
"diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b',
'a', 'a', 'a', 'b', 'b'), 3)),
"score" = c(10,24,15,35,24,14,11,16,33,29, # for 1st session
24,40,30,50,54,25,35,32,53,55, # for 2nd session
45,67,40,78,62,30,41,44,66,81)) # for 3rd session
# Order by participant
<- df %>% arrange(participant)
df
# Remove index
rownames(df) <- NULL
# Add session info
$session <- as.integer(rep(c('1','2', '3'), 10))
df
# Show the data frame
kable(df, align = 'c')
participant | age | diagnosis | score | session |
---|---|---|---|---|
1 | 20 | a | 10 | 1 |
1 | 20 | a | 24 | 2 |
1 | 20 | a | 45 | 3 |
2 | 23 | b | 24 | 1 |
2 | 23 | b | 40 | 2 |
2 | 23 | b | 67 | 3 |
3 | 27 | a | 15 | 1 |
3 | 27 | a | 30 | 2 |
3 | 27 | a | 40 | 3 |
4 | 21 | b | 35 | 1 |
4 | 21 | b | 50 | 2 |
4 | 21 | b | 78 | 3 |
5 | 32 | b | 24 | 1 |
5 | 32 | b | 54 | 2 |
5 | 32 | b | 62 | 3 |
6 | 31 | a | 14 | 1 |
6 | 31 | a | 25 | 2 |
6 | 31 | a | 30 | 3 |
7 | 43 | a | 11 | 1 |
7 | 43 | a | 35 | 2 |
7 | 43 | a | 41 | 3 |
8 | 21 | a | 16 | 1 |
8 | 21 | a | 32 | 2 |
8 | 21 | a | 44 | 3 |
9 | 34 | b | 33 | 1 |
9 | 34 | b | 53 | 2 |
9 | 34 | b | 66 | 3 |
10 | 32 | b | 29 | 1 |
10 | 32 | b | 55 | 2 |
10 | 32 | b | 81 | 3 |
As we can see, there are 5 participants for each diagnosis, and they all seem to have gotten better at the task throughout the three sessions they went through.
In this step, we will split our data into two subsets called
train_set
and test_set
. The main point will
be, that we must avoid leakage between the two sets, before it is of any
use to us.
If we were to split the data randomly, with 80 percent going to the training set, and 20 percent going to the test set, we would likely have some of the participants in both the training set AND the test set. But our motivation for having a test set is that we want to know how well our model predicts new, future participants, not how well it knows the ones it saw during training. Furthermore, if our model is overfitted to those participants, our test set might not warn us of this. With this approach, we could get a really low error on both the test set and the training set even though our model is useless outside of the data, we’re working with.
So, how do we deal with this? In this case, it’s as simple as making
sure each participant is only in one of the data sets. We can do this
with partition()
and the id_col
argument.
::set_test_seed(1) # For reproducibility
xpectr
# Split data in 20/80 (percentage)
partition(df, p = 0.2, id_col = "participant") %>%
1] %>% # See only the test set
.[kable() # Pretty tables :)
|
The above shows the test set. It contains 2 participants (10 and 5). They both have all 3 sessions in this subset, meaning they are not in the training set! Perfect!
But ehmm.. if we look at the diagnosis column, they both have the
diagnosis 'b'
. If we would like to know how well our future
model classifies both of the diagnoses, this won’t do. It would be nicer
if we have somewhat the same balance of both diagnoses in both the
training set and the test set. This is luckily what the
cat_col
argument is for.
More specifically (behind the scenes), it first subsets the full
dataset by each class in the categorical column. Then, it creates
partitions in each subset and merges the partitions in the end. When
using both the id_col
and cat_col
arguments,
it first subsets by class, then partitions by the unique values in
id_col
. This means, that the final partition sizes might
not always be exactly as specified, depending on the data. Therefore, it
is good practice to look at the distribution of participants, classes,
etc. in the final partitions, which we will do after trying out that
cat_col
argument to get our final test and train sets!
::set_test_seed(1) # For reproducibility
xpectr
# Split data in 20/80 (percentage)
<- partition(df, p = 0.2, id_col = "participant", cat_col = 'diagnosis')
parts
<- parts[[1]]
test_set <- parts[[2]]
train_set
# Show test_set
%>% kable() test_set
participant | age | diagnosis | score | session |
---|---|---|---|---|
8 | 21 | a | 16 | 1 |
8 | 21 | a | 32 | 2 |
8 | 21 | a | 44 | 3 |
10 | 32 | b | 29 | 1 |
10 | 32 | b | 55 | 2 |
10 | 32 | b | 81 | 3 |
Now, our test set contains 2 participants (8 and 10), and the diagnosis column contains 50 percent a’s and 50 percent b’s. Let’s count the diagnoses in the training set as well.
%>%
train_set count(diagnosis) %>%
kable(align = 'c')
diagnosis | n |
---|---|
a | 12 |
b | 12 |
There are 12 rows for each diagnosis in the training set. In this case, we know that the rest of the participants are all “fully” in this dataset. But let’s count them anyway, so we will remember to do it in the future.
%>%
train_set count(participant) %>%
kable(align = 'c')
participant | n |
---|---|
1 | 3 |
2 | 3 |
3 | 3 |
4 | 3 |
5 | 3 |
6 | 3 |
7 | 3 |
9 | 3 |
In this section, we will create balanced folds for cross-validation.
The thought process resembles that in the previous section.
fold()
basically just creates a number of similarly sized
partitions. Where partition()
returned a list
of data frame
s (this is optional, btw.),
fold()
will return the entire training set with a new
column called ".folds"
. This will be used directly in the
cross-validation function to subset the data on each iteration.
Remember, that the folds take shifts at being the cross-validation test
set (not to be confused with the test set we created in the previous
step).
As we’re basically recreating the training / test set scenario that we discussed previously, we want to avoid leakage between the folds. We would also like somewhat balanced distributions of the diagnoses between the folds, though this depends on the context. Let’s create our balanced folds:
::set_test_seed(1) # For reproducibility
xpectr
<- fold(train_set, k = 4, cat_col = 'diagnosis', id_col = 'participant')
train_set
# Order by .folds
<- train_set %>% arrange(.folds)
train_set
%>% kable() train_set
participant | age | diagnosis | score | session | .folds |
---|---|---|---|---|---|
7 | 43 | a | 11 | 1 | 1 |
7 | 43 | a | 35 | 2 | 1 |
7 | 43 | a | 41 | 3 | 1 |
2 | 23 | b | 24 | 1 | 1 |
2 | 23 | b | 40 | 2 | 1 |
2 | 23 | b | 67 | 3 | 1 |
1 | 20 | a | 10 | 1 | 2 |
1 | 20 | a | 24 | 2 | 2 |
1 | 20 | a | 45 | 3 | 2 |
5 | 32 | b | 24 | 1 | 2 |
5 | 32 | b | 54 | 2 | 2 |
5 | 32 | b | 62 | 3 | 2 |
6 | 31 | a | 14 | 1 | 3 |
6 | 31 | a | 25 | 2 | 3 |
6 | 31 | a | 30 | 3 | 3 |
4 | 21 | b | 35 | 1 | 3 |
4 | 21 | b | 50 | 2 | 3 |
4 | 21 | b | 78 | 3 | 3 |
3 | 27 | a | 15 | 1 | 4 |
3 | 27 | a | 30 | 2 | 4 |
3 | 27 | a | 40 | 3 | 4 |
9 | 34 | b | 33 | 1 | 4 |
9 | 34 | b | 53 | 2 | 4 |
9 | 34 | b | 66 | 3 | 4 |
The training set now contains the ".folds"
column. As
fold()
groups the output data frame
by this
column, we can count the number of rows for each diagnosis and
participant per fold like so:
%>%
train_set count(participant, diagnosis) %>%
kable(align = 'c')
.folds | participant | diagnosis | n |
---|---|---|---|
1 | 2 | b | 3 |
1 | 7 | a | 3 |
2 | 1 | a | 3 |
2 | 5 | b | 3 |
3 | 4 | b | 3 |
3 | 6 | a | 3 |
4 | 3 | a | 3 |
4 | 9 | b | 3 |
Fold 1 contains participants 2 and 7 with 3 rows each. Participant 2
has the diagnosis 'b'
and participant 7 has the diagnosis
'a'
. This pattern is the same for all folds. This dataset
was created for the purpose of showing these functions, so we should
expect real world data to be less perfectly distributed and remember to
run some checks of the created folds and partitions. Also be aware, that
the maximum number of folds is restricted by the number of participants
in the classes, if using the id_col
and the
cat_col
arguments together. Remember, that the function
first subsets the dataset by cat_col
and then creates
groups (folds) from the unique values in id_col
in each
subset, before merging the subsets. So if you only have 3 participants
in one of the classes, this will be the maximum number of folds you can
create.
What’s left, you ask? Well, now we get to have fun, training and
comparing models using cross-validation! If you just came for
partition()
and fold()
, you can skip this
part. Personally, I will move on and walk you (you won’t let me walk
alone, will you? pleaase?) through a simple cross-validation function
for testing a set of models. The models will be predicting the scores of
the participants.
We have 4 folds in our training set, so we want to train our models on 3 of the folds and test on the last fold. All the folds should be used as test fold once. While there are often faster alternatives to a for loop, we will use it to illustrate the process. We will create a simple cross-validation function where we can specify the model to test, and whether it has random effects. The performance of the model will be measured with RMSE (Root Mean Square Error).
<- function(data, k, model, dependent, random = FALSE){
crossvalidate # 'data' is the training set with the ".folds" column
# 'k' is the number of folds we have
# 'model' is a string describing a linear regression model formula
# 'dependent' is a string with the name of the score column we want to predict
# 'random' is a logical (TRUE/FALSE); do we have random effects in the model?
# Initialize empty list for recording performances
<- c()
performances
# One iteration per fold
for (fold in 1:k){
# Create training set for this iteration
# Subset all the datapoints where .folds does not match the current fold
<- data[data$.folds != fold,]
training_set
# Create test set for this iteration
# Subset all the datapoints where .folds matches the current fold
<- data[data$.folds == fold,]
testing_set
## Train model
# If there is a random effect,
# use lmer() to train model
# else use lm()
if (isTRUE(random)){
# Train linear mixed effects model on training set
<- lmer(model, training_set, REML = FALSE)
model
else {
}
# Train linear model on training set
<- lm(model, training_set)
model
}
## Test model
# Predict the dependent variable in the testing_set with the trained model
<- predict(model, testing_set, allow.new.levels = TRUE)
predicted
# Get the Root Mean Square Error between the predicted and the observed
<- rmse(predicted, testing_set[[dependent]])
RMSE
# Add the RMSE to the performance list
<- RMSE
performances[fold]
}
# Return the mean of the recorded RMSEs
return(c('RMSE' = mean(performances)))
}
We could have a hypothesis that people with the diagnosis
'b'
tend to be better at the experiment. This could be
tested by a simple linear model.
lm(score ~ diagnosis, df) %>%
tidy()
#> # A tibble: 2 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 27.5 4.07 6.74 0.000000253
#> 2 diagnosisb 22.6 5.76 3.92 0.000515
The linear model supports the hypothesis, as the average score of
participants with diagnosis 'b'
is larger than that of
participants with diagnosis 'a'
, and as this difference is
statistically significant, we can reasonably reject the null-hypothesis
that there is no difference between the two diagnoses.
To improve our model, we might want to include the information we
have about age
and session
. Perhaps, the older
participants do better than the younger? And perhaps, participants with
the diagnosis 'b'
are better at learning over time
(session
)? By including such information in our model, we
might be better at predicting the score. We could also use
participant
as a random effect, to factor out the personal
differences.
Let’s list a bunch of possible model formulas that we can then compare with cross-validation. The cross-validation function needs the model to be passed in the format below (a string). Instead of looking at summaries for each model, we will find the best model with cross-validation and only look at the summary for that one. Notice that when we want to compare models, we want to keep the same random effects for all the models, so we are only comparing the combination of fixed effects.
<- 'score ~ 1 + (1 | participant)'
m0 <- 'score ~ diagnosis + (1 | participant)'
m1 <- 'score ~ diagnosis + age + (1 | participant)'
m2 <- 'score ~ diagnosis + session + (1 | participant)'
m3 <- 'score ~ diagnosis * session + (1 | participant)'
m4 <- 'score ~ diagnosis * session + age + (1 | participant)' m5
Now, let’s cross-validate the 6 models. If you get a message saying
"boundary (singular) fit"
, it’s because the model is too
complex for our small dataset. To avoid this message, try removing the
random effects from the models.
m0#> [1] "score ~ 1 + (1 | participant)"
crossvalidate(train_set, k = 4, model = m0, dependent = 'score', random = TRUE)
#> RMSE
#> 18.27222
m1#> [1] "score ~ diagnosis + (1 | participant)"
crossvalidate(train_set, k = 4, model = m1, dependent = 'score', random = TRUE)
#> RMSE
#> 14.7661
m2#> [1] "score ~ diagnosis + age + (1 | participant)"
crossvalidate(train_set, k = 4, model = m2, dependent = 'score', random = TRUE)
#> RMSE
#> 15.28533
m3#> [1] "score ~ diagnosis + session + (1 | participant)"
crossvalidate(train_set, k = 4, model = m3, dependent = 'score', random = TRUE)
#> RMSE
#> 6.176477
m4#> [1] "score ~ diagnosis * session + (1 | participant)"
crossvalidate(train_set, k = 4, model = m4, dependent = 'score', random = TRUE)
#> RMSE
#> 5.950014
m5#> [1] "score ~ diagnosis * session + age + (1 | participant)"
crossvalidate(train_set, k = 4, model = m5, dependent = 'score', random = TRUE)
#> RMSE
#> 7.004665
The model m4
has the lowest error on average and so, we
assume that it is the best predictor of out-of-sample data. To make sure
it doesn’t overfit the data, let’s look at the error on the test
set.
# Training the model on the full training set
<- lmer(m4, train_set, REML = FALSE)
model_m4
# Predict the dependent variable in the test set with the trained model
<- predict(model_m4, test_set, allow.new.levels = TRUE)
predicted
# Get the Root Mean Square Error between the predicted and the observed
<- rmse(predicted, test_set[['score']])
RMSE
RMSE#> [1] 6.418155
6.42 is a bit higher than the cross-validated average, but it’s so close that we’re not afraid of overfitting. Be aware that the scale of the RMSE is dependent on the scale of the dependent variable in our dataset, so you might find some models with much higher RMSEs than this one. What matters is the relative difference between models, and that the best model will have the lowest error.
Let’s look at its summary:
%>%
model_m4 summary()
#> Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
#> method [lmerModLmerTest]
#> Formula: m4
#> Data: train_set
#>
#> AIC BIC logLik deviance df.resid
#> 157.0 164.1 -72.5 145.0 18
#>
#> Scaled residuals:
#> Min 1Q Median 3Q Max
#> -1.8723 -0.6998 -0.0137 0.7042 1.6556
#>
#> Random effects:
#> Groups Name Variance Std.Dev.
#> participant (Intercept) 3.714 1.927
#> Residual 21.398 4.626
#> Number of obs: 24, groups: participant, 8
#>
#> Fixed effects:
#> Estimate Std. Error df t value Pr(>|t|)
#> (Intercept) 0.1667 3.6621 22.2764 0.046 0.9641
#> diagnosisb 9.4167 5.1790 22.2764 1.818 0.0825 .
#> session 13.2500 1.6355 16.0000 8.102 4.71e-07 ***
#> diagnosisb:session 6.3750 2.3129 16.0000 2.756 0.0141 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Correlation of Fixed Effects:
#> (Intr) dgnssb sessin
#> diagnosisb -0.707
#> session -0.893 0.632
#> dgnssb:sssn 0.632 -0.893 -0.707
In this model, we have a significant interaction between diagnosis and session. The interpretation of this result would be quite different from that of the first model, we tried.
Notice that the cross-validated m3
model is so close to
the cross-validated m4
model that we can’t be completely
sure which is better. The observed differences might as well stem from
the randomness in the partition()
or fold()
steps. In such a case, we could turn to repeated
cross-validation, where we create multiple fold columns (by setting
num_fold_cols
in fold()
) and average their
cross-validated results. If the models are still very close, we might
consider reporting all three models or at least checking if the
interpretations of the three models differ. These things are highly
dependent on the context.
When working with linear and logistic regression, the cvms
package contains functions for (repeated) cross-validation and is
specifically designed to work well with groupdata2
.
Well done, you made it to the end of this introduction to
groupdata2
! If you want to know more about the various
methods and arguments, you can read the Description of groupdata2.
If you have any questions or comments to this vignette (tutorial) or
groupdata2
, please send them to me at
r-pkgs@ludvigolsen.dk, or open an issue on the github
page https://github.com/LudvigOlsen/groupdata2 so I can make
improvements.