The rsample package provides a number of resampling methods which are broadly applicable to a wide variety of modeling applications. This vignette walks through the most popular methods in the package, with brief descriptions of how they can be applied. For a more in-depth overview of resampling, check out the matching chapters in Tidy Modeling with R and Feature Engineering and Selection.
Let’s go ahead and load rsample now:
library(rsample)
As well as dplyr, for the pipe operator %>%
:
library(dplyr)
We’ll also load in a few data sets from the modeldata package. First, the Ames housing data, containing the sale prices of homes in Ames, Iowa:
data(ames, package = "modeldata")
head(ames, 2)
#> # A tibble: 2 × 74
#> MS_SubC…¹ MS_Zo…² Lot_F…³ Lot_A…⁴ Street Alley Lot_S…⁵ Land_…⁶ Utili…⁷ Lot_C…⁸
#> <fct> <fct> <dbl> <int> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 One_Stor… Reside… 141 31770 Pave No_A… Slight… Lvl AllPub Corner
#> 2 One_Stor… Reside… 80 11622 Pave No_A… Regular Lvl AllPub Inside
#> # … with 64 more variables: Land_Slope <fct>, Neighborhood <fct>,
#> # Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#> # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
#> # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
#> # Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
#> # BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, …
#> # ℹ Use `colnames()` to see all variable names
Secondly, data on Chicago transit ridership numbers:
data(Chicago, package = "modeldata")
head(Chicago, 2)
#> # A tibble: 2 × 50
#> rider…¹ Austin Quinc…² Belmont Arche…³ Oak_P…⁴ Western Clark…⁵ Clinton Merch…⁶
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 15.7 1.46 8.37 4.60 2.01 1.42 3.32 15.6 2.40 6.48
#> 2 15.8 1.50 8.35 4.72 2.09 1.43 3.34 15.7 2.40 6.48
#> # … with 40 more variables: Irving_Park <dbl>, Washington_Wells <dbl>,
#> # Harlem <dbl>, Monroe <dbl>, Polk <dbl>, Ashland <dbl>, Kedzie <dbl>,
#> # Addison <dbl>, Jefferson_Park <dbl>, Montrose <dbl>, California <dbl>,
#> # temp_min <dbl>, temp <dbl>, temp_max <dbl>, temp_change <dbl>, dew <dbl>,
#> # humidity <dbl>, pressure <dbl>, pressure_change <dbl>, wind <dbl>,
#> # wind_max <dbl>, gust <dbl>, gust_max <dbl>, percip <dbl>, percip_max <dbl>,
#> # weather_rain <dbl>, weather_snow <dbl>, weather_cloud <dbl>, …
#> # ℹ Use `colnames()` to see all variable names
In addition to these data sets from the modeldata package, we’ll also make use of the Orange data set in base R, containing repeated measurements of 5 orange trees over time:
head(Orange, 2)
#> Tree age circumference
#> 1 1 118 30
#> 2 1 484 58
And last but not least, we’ll set a seed so our results are reproducible:
set.seed(123)
By far and away, the most common use for rsample is to generate simple random resamples of your data. The rsample package includes a number of functions specifically for this purpose.
To split your data into two sets – often referred to as the
“training” and “testing” sets – rsample provides the
initial_split()
function:
initial_split(ames)
#> <Training/Testing/Total>
#> <2197/733/2930>
The output of this is an rsplit object
with each observation assigned to one of the two sets. You can control
the proportion of data assigned to the “training” set through the
prop
argument:
initial_split(ames, prop = 0.8)
#> <Training/Testing/Total>
#> <2344/586/2930>
To get the actual data assigned to either set, use the
training()
and testing()
functions:
<- initial_split(ames, prop = 0.6)
resample
head(training(resample), 2)
#> # A tibble: 2 × 74
#> MS_SubC…¹ MS_Zo…² Lot_F…³ Lot_A…⁴ Street Alley Lot_S…⁵ Land_…⁶ Utili…⁷ Lot_C…⁸
#> <fct> <fct> <dbl> <int> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 One_Stor… Reside… 110 14333 Pave No_A… Regular Lvl AllPub Corner
#> 2 One_Stor… Reside… 65 8450 Pave No_A… Regular Lvl AllPub Inside
#> # … with 64 more variables: Land_Slope <fct>, Neighborhood <fct>,
#> # Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#> # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
#> # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
#> # Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
#> # BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, …
#> # ℹ Use `colnames()` to see all variable names
head(testing(resample), 2)
#> # A tibble: 2 × 74
#> MS_SubC…¹ MS_Zo…² Lot_F…³ Lot_A…⁴ Street Alley Lot_S…⁵ Land_…⁶ Utili…⁷ Lot_C…⁸
#> <fct> <fct> <dbl> <int> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 One_Stor… Reside… 141 31770 Pave No_A… Slight… Lvl AllPub Corner
#> 2 One_Stor… Reside… 80 11622 Pave No_A… Regular Lvl AllPub Inside
#> # … with 64 more variables: Land_Slope <fct>, Neighborhood <fct>,
#> # Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> # Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#> # Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
#> # Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
#> # Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>,
#> # BsmtFin_SF_1 <dbl>, BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, …
#> # ℹ Use `colnames()` to see all variable names
You should only evaluate models against your test set once, when you’ve completely finished tuning and training your models. However, it’s possible to have additional sets of data “held out” from the model training process, which can be used to evaluate models multiple times before you’re ready to evaluate against the final test set.
These sets of data are often called “validation sets”, and can be
created in rsample via validation_split()
:
validation_split(ames, prop = 0.8)
#> # Validation Set Split (0.8/0.2)
#> # A tibble: 1 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2344/586]> validation
These validation splits separate your data into “analysis” and “assessment” sets, which you can use to fit models and assess their accuracy while still preserving your initial hold-out test set.
Just like initial_split()
, you can control the amount of
data assigned to each set using the prop
argument. Unlike
the output from initial_split()
, however, the output from
validation_split()
is an rset
object, which can then be used by other packages in the tidymodels
universe (such as tune) to
evaluate model performance.
For hyperparameter tuning and model fitting, it’s often useful to assess your model against more than just a single validation set in order to get a more stable estimate of model performance. As a result, modelers often use a process known as cross-validation, where your data is split into analysis and assessment sets multiple times.
Perhaps the most common cross-validation method is V-fold cross-validation. Also known as “k-fold cross-validation”, this method creates V resamples by splitting your data into V groups (also known as “folds”) of roughly equal size. The analysis set of each resample is made up of V-1 folds, with the remaining fold being used as the assessment set. This way, each observation in your data is used in exactly one assessment set.
To use V-fold cross-validation in rsample, use the
vfold_cv()
function:
vfold_cv(ames, v = 2)
#> # 2-fold cross-validation
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1465/1465]> Fold1
#> 2 <split [1465/1465]> Fold2
One downside to V-fold cross validation is that it tends to produce
“noisy”, or high-variance, estimates when
compared to other resampling methods. To try and reduce that
variance, it’s often helpful to perform what’s known as repeated
cross-validation, effectively running the V-fold resampling
procedure multiple times for your data. To perform repeated V-fold
cross-validation in rsample, you can use the repeats argument inside of
vfold_cv()
:
vfold_cv(ames, v = 2, repeats = 2)
#> # 2-fold cross-validation repeated 2 times
#> # A tibble: 4 × 3
#> splits id id2
#> <list> <chr> <chr>
#> 1 <split [1465/1465]> Repeat1 Fold1
#> 2 <split [1465/1465]> Repeat1 Fold2
#> 3 <split [1465/1465]> Repeat2 Fold1
#> 4 <split [1465/1465]> Repeat2 Fold2
An alternative to V-fold cross-validation is Monte-Carlo cross-validation. Where V-fold assigns each observation in your data to one (and exactly one) assessment set, Monte-Carlo cross-validation takes a random subset of your data for each assessment set, meaning each observation can be used in 0, 1, or many assessment sets. The analysis set is then made up of all the observations that weren’t selected. Because each assessment set is sampled independently, you can repeat this as many times as you want.
To use Monte-Carlo cross-validation in rsample, use the
mc_cv()
function:
mc_cv(ames, prop = 0.8, times = 2)
#> # Monte Carlo cross-validation (0.8/0.2) with 2 resamples
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2344/586]> Resample1
#> 2 <split [2344/586]> Resample2
Just as with validation_set()
, you can control the
proportion of your data assigned to the analysis fold using
prop
. You can also control the number of resamples you
create using the times
argument.
Monte-Carlo cross-validation tends to produce more biased estimates than V-fold. As such, when computationally feasible we typically recommend using five or so repeats of 10-fold cross-validation for model assessment.
The last primary resampling technique in rsample is bootstrap resampling. A “bootstrap sample” is a sample of your data set, the same size as your data set, taken with replacement so that a single observation might be sampled multiple times. The assessment set is then made up of all the observations that weren’t selected for the analysis set. Generally, bootstrap resampling produces pessimistic estimates of model accuracy.
You can create bootstrap resamples in rsample using the
bootstraps()
function. While you can’t control the
proportion of data in each set – the assessment set of a bootstrap
resample is always the same size as the training data – the function
otherwise works exactly like mc_cv()
:
bootstraps(ames, times = 2)
#> # Bootstrap sampling
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2930/1081]> Bootstrap1
#> 2 <split [2930/1084]> Bootstrap2
If your data is heavily imbalanced (that is, if the distribution of an important continuous variable is skewed, or some classes of a categorical variable are much more common than others), simple random resampling may accidentally skew your data even further by allocating more “rare” observations disproportionately into the analysis or assessment fold. In these situations, it can be useful to instead use stratified resampling to ensure the analysis and assessment folds have a similar distribution as your overall data.
All of the functions discussed so far support stratified resampling
through their strata
argument. This argument takes a single
column identifier and uses it to stratify the resampling procedure:
vfold_cv(ames, v = 2, strata = Sale_Price)
#> # 2-fold cross-validation using stratification
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1464/1466]> Fold1
#> 2 <split [1466/1464]> Fold2
By default, rsample will cut continuous variables into four bins, and
ensure that each bin is proportionally represented in each set. If
desired, this behavior can be changed using the breaks
argument:
vfold_cv(ames, v = 2, strata = Sale_Price, breaks = 100)
#> # 2-fold cross-validation using stratification
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1439/1491]> Fold1
#> 2 <split [1491/1439]> Fold2
Often, some observations in your data will be “more related” to each other than would be probable under random chance, for instance because they represent repeated measurements of the same subject or were all collected at a single location. In these situations, you often want to assign all related observations to either the analysis or assessment fold as a group, to avoid having assessment data that’s closely related to the data used to fit a model.
All of the functions discussed so far have a “grouped resampling”
variation to handle these situations. These functions all start with the
group_
prefix, and use the argument group
to
specify which column should be used to group observations. Other than
respecting these groups, these functions all work like their ungrouped
variants:
<- group_initial_split(Orange, group = Tree)
resample
unique(training(resample)$Tree)
#> [1] 1 2 4 5
#> Levels: 3 < 1 < 5 < 2 < 4
unique(testing(resample)$Tree)
#> [1] 3
#> Levels: 3 < 1 < 5 < 2 < 4
It’s important to note that, while functions like
group_mc_cv()
and group_validation_split()
still let you specify what proportion of your data should be in the
analysis set (and group_bootstraps()
still attempts to
create analysis sets the same size as your original data), rsample won’t
“split” groups in order to exactly meet that proportion. These functions
start out by assigning one group at random to each set (or, for
group_vfold_cv()
, to each fold) and then assign each of the
remaining groups, in a random order, to whichever set brings the
relative sizes of each set closest to the target proportion. That means
that resamples are randomized, and you can safely use repeated
cross-validation just as you would with ungrouped resampling, but also
means you can wind up with very differently sized analysis and
assessment sets than anticipated if your groups are unbalanced:
set.seed(1)
group_bootstraps(ames, Neighborhood, times = 2)
#> # Group bootstrap sampling
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [2939/907]> Bootstrap1
#> 2 <split [2958/635]> Bootstrap2
While most of the grouped resampling functions are always focused on
balancing the proportion of data in the analysis set, by default
group_vfold_cv()
will attempt to balance the number of
groups assigned to each fold. If instead you’d like to balance the
number of observations in each fold (meaning your assessment sets will
be of similar sizes, but smaller groups will be more likely to be
assigned to the same folds than would happen under random chance), you
can use the argument balance = "observations"
:
group_vfold_cv(ames, Neighborhood, balance = "observations", v = 2)
#> # Group 2-fold cross-validation
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1475/1455]> Resample1
#> 2 <split [1455/1475]> Resample2
If you’re working with spatial data, your observations will often be more related to their neighbors than to the rest of the data set; as Tobler’s first law of geography puts it, “everything is related to everything else, but near things are more related than distant things.” However, you often won’t have a pre-defined “location” variable that you can use to group related observations. The spatialsample package provides functions for spatial cross-validation using rsample syntax and classes, and is often useful for these situations.
When working with time-based data, it usually doesn’t make sense to randomly resample your data: random resampling will likely result in your analysis set having observations from later than your assessment set, which isn’t a realistic way to assess model performance.
As such, rsample provides a few different functions to make sure that all data in your assessment sets are after that in the analysis set.
First off, two variants on initial_split()
and
validation_split()
, initial_time_split()
and
validation_time_split()
, will assign the first
rows of your data to the analysis set (with the number of rows assigned
determined by prop
):
initial_time_split(Chicago)
#> <Training/Testing/Total>
#> <4273/1425/5698>
validation_time_split(Chicago)
#> # Validation Set Split (0.75/0.25)
#> # A tibble: 1 × 2
#> splits id
#> <list> <chr>
#> 1 <split [4273/1425]> validation
There are also several functions in rsample to help you construct
multiple analysis and assessment sets from time-based data. For
instance, the sliding_window()
will create “windows” of
your data, moving down through the rows of the data frame:
sliding_window(Chicago) %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1/1]> Slice0001
#> 2 <split [1/1]> Slice0002
If you want to create sliding windows of your data based on a
specific variable, you can use the sliding_index()
function:
sliding_index(Chicago, date) %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [1/1]> Slice0001
#> 2 <split [1/1]> Slice0002
And if you want to set the size of windows based on units of time,
for instance to have each window contain a year of data, you can use
sliding_period()
:
sliding_period(Chicago, date, "year") %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [344/365]> Slice01
#> 2 <split [365/365]> Slice02
All of these functions produce analysis sets of the same size, with
the start and end of the analysis set “sliding” down your data frame. If
you’d rather have your analysis set get progressively larger, so that
you’re predicting new data based upon a growing set of older
observations, you can use the rolling_origin()
function:
rolling_origin(Chicago) %>%
head(2)
#> # A tibble: 2 × 2
#> splits id
#> <list> <chr>
#> 1 <split [5/1]> Slice0001
#> 2 <split [6/1]> Slice0002
Note that all of these time-based resampling functions are deterministic: unlike the rest of the package, running these functions repeatedly under different random seeds will always return the same results.