library(simpr)
set.seed(2001)
This vignette provides an overview of each step in the simpr
workflow:
specify()
your data-generating process
define()
parameters that you want to systematically vary across your simulation design (e.g. n, effect size)
generate()
the simulation data
fit()
models to your data (e.g. lm()
)
tidy_fits()
for further processing, such as computing power or Type I Error rates
specify()
your data-generating processspecify()
takes arbitrary R expressions that can be used for generating data. Each argument should have a name and be prefixed with ~
, the tilde operator. Order matters: later arguments can refer to earlier arguments, but not the other way around.
Good—b
specification refers to a
specification, and comes after a
:
specify(a = ~ runif(6),
b = ~ a + rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.532 0.600
#> 2 0.946 0.978
#> 3 0.451 1.86
#> 4 0.249 -1.02
#> 5 0.376 0.409
#> 6 0.824 0.433
Error—b
specification refers to a
specification, but comes before a
, so generate()
doesn’t know what a
is:
specify(b = ~ a + rnorm(6),
a = ~ runif(6)) %>%
generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "Error in eval_fun(): object 'a' not found\n"
All arguments must imply the same number of rows. Arguments that imply 1 row are recycled.
OK—both a
and b
imply 6 rows:
specify(a = ~ runif(6),
b = ~ rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.376 1.41
#> 2 0.824 -1.27
#> 3 0.527 0.0321
#> 4 0.725 -0.392
#> 5 0.513 0.205
#> 6 0.149 -0.186
OK—a
implies 1 row and b
implies 6 rows, so a
is recycled:
specify(a = ~ runif(1),
b = ~ rnorm(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.527 0.597
#> 2 0.527 -1.04
#> 3 0.527 -0.0124
#> 4 0.527 -0.0558
#> 5 0.527 0.457
#> 6 0.527 -0.578
Error—a
implies 2 rows and b
implies 6 rows:
specify(a = ~ runif(2),
b = ~ rnorm(6)) %>%
generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "Error in `stop_vctrs()`:\n! Can't recycle `..1` (size 2…
Using x
as an argument to specify()
is not recommended, because for technical reasons x
is always placed as the first argument. This means that if x
refers to prior variables it will return an error:
specify(y = ~ runif(6),
x = ~ y + runif(6)) %>%
generate(1)
#> Formula specification for 'x' detected. Assuming 'x' is the first formula.
#>
#> To hide this message, or to avoid moving this formula first, use a different variable name.
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#> .sim_id rep sim .sim_error
#> <int> <int> <list> <chr>
#> 1 1 1 <NULL> "Error in eval_fun(): object 'y' not found\n"
The same specification works fine when x
is renamed:
specify(y = ~ runif(6),
a = ~ y + runif(6)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> y a
#> <dbl> <dbl>
#> 1 0.102 0.683
#> 2 0.478 0.663
#> 3 0.513 0.939
#> 4 0.676 1.28
#> 5 0.348 0.976
#> 6 0.282 0.686
specify()
accepts expressions that generate multiple columns simultaneously in a matrix
, data.frame
, or tibble
. By default, the column names in the output append a number to the variable name.
Here’s an example using MASS::mvrnorm()
, which returns draws from the multivariate normal distribution as a matrix. MASS::mvrnorm()
determines the number of columns for the output data from the length of mu
and the dimensions of the variance-covariance matrix Sigma
.
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a_1 a_2 a_3
#> <dbl> <dbl> <dbl>
#> 1 0.477 -0.445 0.0321
#> 2 0.236 -0.280 -0.392
#> 3 -0.656 0.841 0.205
#> 4 1.80 1.51 -0.186
#> 5 -0.458 -0.597 0.327
#> 6 0.130 0.549 0.199
The argument was named a
in specify()
, so generate()
creates three variables named a_1
, a_2
, and a_3
.
You can change the separator between the argument name and the number via .sep
. Here, we change it from the default "_"
to "."
:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
.sep = ".") %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a.1 a.2 a.3
#> <dbl> <dbl> <dbl>
#> 1 0.236 -0.280 -0.392
#> 2 -0.656 0.841 0.205
#> 3 1.80 1.51 -0.186
#> 4 -0.458 -0.597 0.327
#> 5 0.130 0.549 0.199
#> 6 0.155 0.477 -0.445
Alternatively, you can give a two-sided formula to set names. The argument name is ignored, and the left hand side must use c
or cbind
.
specify(y = c(a, b, c) ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3)))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> a b c
#> <dbl> <dbl> <dbl>
#> 1 -0.656 0.841 0.205
#> 2 1.80 1.51 -0.186
#> 3 -0.458 -0.597 0.327
#> 4 0.130 0.549 0.199
#> 5 0.155 0.477 -0.445
#> 6 0.201 0.236 -0.280
If your expression already produces column names, those are used by default. The argument name is again ignored:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))) %>%
::set_colnames(c("d", "e", "f"))) %>%
magrittrgenerate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> d e f
#> <dbl> <dbl> <dbl>
#> 1 1.80 1.51 -0.186
#> 2 -0.458 -0.597 0.327
#> 3 0.130 0.549 0.199
#> 4 0.155 0.477 -0.445
#> 5 0.201 0.236 -0.280
#> 6 0.153 -0.656 0.841
This is useful for dealing with functions from other packages that already provide informative names (e.g., lavaan::simulateData()
). You can turn this behavior off with .use_names = FALSE
.
Whatever method you use, you can still refer to these generated variables in subsequent arguments to specify()
:
specify(a = ~ MASS::mvrnorm(6,
mu = rep(0, 3),
Sigma = diag(rep(1, 3))),
b = ~ a_1 - a_2) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 4]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 4
#> a_1 a_2 a_3 b
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.458 -0.597 0.327 0.139
#> 2 0.130 0.549 0.199 -0.420
#> 3 0.155 0.477 -0.445 -0.323
#> 4 0.201 0.236 -0.280 -0.0347
#> 5 0.153 -0.656 0.841 0.810
#> 6 1.32 1.80 1.51 -0.480
define()
parameters that you want to systematically varydefine()
creates metaparameters (also called simulation factors): values that you want to systematically vary. An obvious choice is sample size.
Instead of writing a number for the n
argument of rnorm
, we write a placeholder value samp_size
(this can be any valid R name), and we write a corresponding argument in define()
that contains the possible values that samp_size
can take on:
specify(a = ~ rnorm(samp_size)) %>%
define(samp_size = c(10, 20)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 4
#> .sim_id samp_size rep sim
#> <int> <dbl> <int> <list>
#> 1 1 10 1 <tibble [10 × 1]>
#> 2 2 20 1 <tibble [20 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#> a
#> <dbl>
#> 1 0.199
#> 2 -0.445
#> 3 -0.280
#> 4 0.841
#> 5 1.51
#> 6 -0.597
#> 7 0.549
#> 8 0.477
#> 9 0.236
#> 10 -0.656
Each argument to define()
is a vector or list with the desired values to be used in the expressions in specify()
. specify()
expressions can refer to the names of define arguments, and generate()
will substitute the possible values that argument when generating data.
When define()
has multiple arguments, each possible combination is generated:
specify(a = ~ rnorm(samp_size, mu)) %>%
define(samp_size = c(10, 20),
mu = c(0, 10)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 4 × 5
#> .sim_id samp_size mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 10 0 1 <tibble [10 × 1]>
#> 2 2 20 0 1 <tibble [20 × 1]>
#> 3 3 10 10 1 <tibble [10 × 1]>
#> 4 4 20 10 1 <tibble [20 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#> a
#> <dbl>
#> 1 -0.445
#> 2 -0.280
#> 3 0.841
#> 4 1.51
#> 5 -0.597
#> 6 0.549
#> 7 0.477
#> 8 0.236
#> 9 -0.656
#> 10 1.80
(If not all combinations are desired, see options for filtering at the generate()
step, below.)
define()
can also take lists for any type of value used in specify
. For instance, the argument s
is defined as a list with two variables representing two different possible correlation matrices, and we use that placeholder value s
in the specify()
statement:
specify(a = ~ MASS::mvrnorm(6, rep(0, 2), Sigma = s)) %>%
define(s = list(independent = diag(rep(1, 2)),
dependent = matrix(c(1, 0.5, 0.5, 1), nrow = 2))) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id s_index rep s sim
#> <int> <chr> <int> <list> <list>
#> 1 1 independent 1 <dbl [2 × 2]> <tibble [6 × 2]>
#> 2 2 dependent 1 <dbl [2 × 2]> <tibble [6 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a_1 a_2
#> <dbl> <dbl>
#> 1 -0.236 -0.280
#> 2 0.656 0.841
#> 3 -1.80 1.51
#> 4 0.458 -0.597
#> 5 -0.130 0.549
#> 6 -0.155 0.477
In the output, simpr
creates a column s_index
using the names of the list elements of s
to make the results easier to view and filter.
define()
also supports lists of functions. Here, the specify
command has a placeholder function distribution
, where distribution
is defined to be either rnorm()
or rlnorm()
(the lognormal distribution):
specify(y = ~ distribution(6)) %>%
define(distribution = list(normal = rnorm,
lognormal = rlnorm)) %>%
generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id distribution_index rep distribution sim
#> <int> <chr> <int> <list> <list>
#> 1 1 normal 1 <fn> <tibble [6 × 1]>
#> 2 2 lognormal 1 <fn> <tibble [6 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> y
#> <dbl>
#> 1 0.841
#> 2 1.51
#> 3 -0.597
#> 4 0.549
#> 5 0.477
#> 6 0.236
generate()
the simulation dataThe main argument to generate()
is .reps
, the number of repetitions for each combination of metaparameters in define()
.
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2)
#> full tibble
#> --------------------------
#> # A tibble: 8 × 5
#> .sim_id n mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 6 0 1 <tibble [6 × 1]>
#> 2 2 12 0 1 <tibble [12 × 1]>
#> 3 3 6 10 1 <tibble [6 × 1]>
#> 4 4 12 10 1 <tibble [12 × 1]>
#> 5 5 6 0 2 <tibble [6 × 1]>
#> 6 6 12 0 2 <tibble [12 × 1]>
#> 7 7 6 10 2 <tibble [6 × 1]>
#> 8 8 12 10 2 <tibble [12 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 1.51
#> 2 -0.597
#> 3 0.549
#> 4 0.477
#> 5 0.236
#> 6 -0.656
Since there are 4 possible combinations of n
and mu
, there are a total of 4 * 2 = 8 simulations generated, 2 for each possible combination.
If some combination of variables is not desired, add filtering criteria to generate()
using the same syntax as dplyr::filter()
. Here we arbitrarily filter to all combinations of n
and mu
where n
is greater than mu
, but any valid filtering criteria can be applied.
specify(a = ~ rnorm(n, mu)) %>%
define(n = c(6, 12),
mu = c(0, 10)) %>%
generate(2, n > mu)
#> full tibble
#> --------------------------
#> # A tibble: 6 × 5
#> .sim_id n mu rep sim
#> <int> <dbl> <dbl> <int> <list>
#> 1 1 6 0 1 <tibble [6 × 1]>
#> 2 2 12 0 1 <tibble [12 × 1]>
#> 3 4 12 10 1 <tibble [12 × 1]>
#> 4 5 6 0 2 <tibble [6 × 1]>
#> 5 6 12 0 2 <tibble [12 × 1]>
#> 6 8 12 10 2 <tibble [12 × 1]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 -0.597
#> 2 0.549
#> 3 0.477
#> 4 0.236
#> 5 -0.656
#> 6 1.80
To preserve reproducibility, a given simulation in the filtered version of generate
is still the same as if all possible combinations were generated. This can be tracked using the .sim_id
that generate()
includes in the output data, which uniquely identifies the simulation run given the same inputs to specify
, define
, and .reps
. Above, note that .sim_id
skips 3 and 7 See vignette("Reproducing simulations")
for more information on using generate()
to filter.
generate()
by default continues with the next iteration if an error is produced, and returns a column .sim_error
with the text of the error. Alternative error handling mechanisms are available, see vignette("Managing simulation errors")
.
fit()
models to your datafit()
uses similar syntax to generate()
: you can write arbitrary R expressions to fit models. Again, each argument should have a name and be prefixed with ~
, the tilde operator.
Below, we fit both a t-test and a linear model.
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
fit(t_test = ~ t.test(a, b),
lm = ~ lm(b ~ a))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim t_test lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 2]> <htest> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.549 0.679
#> 2 0.477 0.632
#> 3 0.236 0.437
#> 4 -0.656 -0.503
#> 5 1.80 3.13
#> 6 -0.458 0.412
#>
#> t_test[[1]]
#> --------------------------
#>
#> Welch Two Sample t-test
#>
#> data: a and b
#> t = -0.76979, df = 9.0768, p-value = 0.461
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -1.8583360 0.9137887
#> sample estimates:
#> mean of x mean of y
#> 0.3255341 0.7978077
#>
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = b ~ a)
#>
#> Coefficients:
#> (Intercept) a
#> 0.3739 1.3021
fit()
adds columns to the overall tibble to contain each type of fit. Printing the object displays a preview of the first fit object in each column.
Although the function is named fit
, any arbitrary R expression can be used. Below, one fit column will include the mean of a
, while the second will include vectors that are five larger than a
. The result is always a list-column, so any type of return object is allowed:
specify(a = ~ rnorm(6)) %>%
generate(1) %>%
fit(mean = ~ mean(a),
why_not = ~ a + 5)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim mean why_not
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 1]> <dbl [1]> <dbl [6]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#> a
#> <dbl>
#> 1 0.477
#> 2 0.236
#> 3 -0.656
#> 4 1.80
#> 5 -0.458
#> 6 0.130
#>
#> mean[[1]]
#> --------------------------
#> [1] 0.2555629
#>
#> why_not[[1]]
#> --------------------------
#> [1] 5.477269 5.236078 4.343804 6.804318 4.542280 5.129628
fit()
is computed for each individual simulated dataset, so usually you do not need to refer to the dataset itself. If a reference to the dataset is needed, use .
.
The below code is equivalent to the previous example, but explicitly referencing the dataset using .
and .$
:
specify(a = ~ rnorm(6),
b = ~ a + rnorm(6)) %>%
generate(1) %>%
## .$ and data = . not actually required here
fit(t_test = ~ t.test(.$a, .$b),
lm = ~ lm(b ~ a, data = .))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#> .sim_id rep sim t_test lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [6 × 2]> <htest> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#> a b
#> <dbl> <dbl>
#> 1 0.236 0.437
#> 2 -0.656 -0.503
#> 3 1.80 3.13
#> 4 -0.458 0.412
#> 5 0.130 1.77
#> 6 0.155 0.740
#>
#> t_test[[1]]
#> --------------------------
#>
#> Welch Two Sample t-test
#>
#> data: .$a and .$b
#> t = -1.2662, df = 8.8051, p-value = 0.2379
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -2.2239947 0.6312032
#> sample estimates:
#> mean of x mean of y
#> 0.2018126 0.9982084
#>
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = b ~ a, data = .)
#>
#> Coefficients:
#> (Intercept) a
#> 0.7276 1.3411
Sometimes data manipulation is required between generate()
and fit()
. After generate()
, run per_sim()
and then chain any dplyr
or tidyr
verbs that work on data.frame
s or tibble
s. These verbs will be applied to every individual simulated dataset.
A common use-case is needing to reshape wide to long. Consider an intervention study with a control group, an intervention group that does slightly better than the control, and a second intervention group that does much better. This is easiest to specify in wide format, with separate variables for each group and differing means by group:
= specify(control = ~ rnorm(6, mean = 0),
wide_gen intervention_1 = ~ rnorm(6, mean = 0.2),
intervention_2 = ~ rnorm(6, mean = 2)) %>%
generate(2)
wide_gen#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [6 × 3]>
#> 2 2 2 <tibble [6 × 3]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#> control intervention_1 intervention_2
#> <dbl> <dbl> <dbl>
#> 1 -0.656 0.353 2.94
#> 2 1.80 1.52 2.03
#> 3 -0.458 1.07 2.30
#> 4 0.130 1.84 1.50
#> 5 0.155 0.785 1.70
#> 6 0.201 -0.539 1.65
But to run an ANOVA, we need the outcome in one column and the group name in another column. We first run per_sim()
to indicate we want to compute on individual simulated datasets, and then we can use tidyr::pivot_longer()
to reshape each dataset into a format ready for analysis:
= wide_gen %>%
long_gen per_sim() %>%
pivot_longer(cols = everything(),
names_to = "group",
values_to = "response")
long_gen#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#> .sim_id rep sim
#> <int> <int> <list>
#> 1 1 1 <tibble [18 × 2]>
#> 2 2 2 <tibble [18 × 2]>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#> group response
#> <chr> <dbl>
#> 1 control -0.656
#> 2 intervention_1 0.353
#> 3 intervention_2 2.94
#> 4 control 1.80
#> 5 intervention_1 1.52
#> 6 intervention_2 2.03
#> 7 control -0.458
#> 8 intervention_1 1.07
#> 9 intervention_2 2.30
#> 10 control 0.130
#> 11 intervention_1 1.84
#> 12 intervention_2 1.50
#> 13 control 0.155
#> 14 intervention_1 0.785
#> 15 intervention_2 1.70
#> 16 control 0.201
#> 17 intervention_1 -0.539
#> 18 intervention_2 1.65
Each simulation is now reshaped and ready for fitting:
= long_gen %>%
long_fit fit(aov = ~ aov(response ~ group),
lm = ~ lm(response ~ group))
long_fit#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#> .sim_id rep sim aov lm
#> <int> <int> <list> <list> <list>
#> 1 1 1 <tibble [18 × 2]> <aov> <lm>
#> 2 2 2 <tibble [18 × 2]> <aov> <lm>
#>
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#> group response
#> <chr> <dbl>
#> 1 control -0.656
#> 2 intervention_1 0.353
#> 3 intervention_2 2.94
#> 4 control 1.80
#> 5 intervention_1 1.52
#> 6 intervention_2 2.03
#> 7 control -0.458
#> 8 intervention_1 1.07
#> 9 intervention_2 2.30
#> 10 control 0.130
#> 11 intervention_1 1.84
#> 12 intervention_2 1.50
#> 13 control 0.155
#> 14 intervention_1 0.785
#> 15 intervention_2 1.70
#> 16 control 0.201
#> 17 intervention_1 -0.539
#> 18 intervention_2 1.65
#>
#> aov[[1]]
#> --------------------------
#> Call:
#> aov(formula = response ~ group)
#>
#> Terms:
#> group Residuals
#> Sum of Squares 10.270758 8.853192
#> Deg. of Freedom 2 15
#>
#> Residual standard error: 0.7682531
#> Estimated effects may be unbalanced
#>
#> lm[[1]]
#> --------------------------
#>
#> Call:
#> lm(formula = response ~ group)
#>
#> Coefficients:
#> (Intercept) groupintervention_1 groupintervention_2
#> 0.1960 0.6437 1.8242
tidy_fits()
for further processingThe output of fit()
is not yet amenable to plotting or analysis. tidy_fits()
applies broom::tidy()
to each fit object and binds them together in a single tibble
:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits()
#> # A tibble: 8 × 9
#> .sim_id n rep Source term estimate std.error statistic p.value
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Intercept) 0.684 0.406 1.69 0.167
#> 2 1 6 1 lm a 1.26 0.526 2.40 0.0745
#> 3 2 12 1 lm (Intercept) -0.117 0.288 -0.407 0.693
#> 4 2 12 1 lm a 0.739 0.255 2.90 0.0160
#> 5 3 6 2 lm (Intercept) -0.713 0.715 -0.997 0.375
#> 6 3 6 2 lm a 0.503 0.687 0.731 0.505
#> 7 4 12 2 lm (Intercept) 0.335 0.238 1.41 0.190
#> 8 4 12 2 lm a 1.30 0.317 4.11 0.00210
All the same metaparameter information appears in the left-hand column, and all the model information from broom::tidy()
is provided. This is a convenient format for filtering, plotting, and calculating diagnostics.
If more than one kind of fit is present, tidy_fits()
simply brings them together in the same tibble
using bind_rows
; this means there may be many NA
values where one type of model has no values. In the example below, t.test
returns the column estimate1
, but lm
does not, so for the lm
rows there are NA
values.
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a),
t_test = ~ t.test(a, b)) %>%
tidy_fits()
#> # A tibble: 12 × 16
#> .sim_id n rep Source term estimate std.error statistic p.value
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Intercept) 0.688 0.385 1.79 0.149
#> 2 1 6 1 lm a 0.471 0.656 0.718 0.513
#> 3 1 6 1 t_test <NA> -0.555 NA -1.36 0.206
#> 4 2 12 1 lm (Intercept) -0.161 0.279 -0.578 0.576
#> 5 2 12 1 lm a 0.803 0.248 3.23 0.00897
#> 6 2 12 1 t_test <NA> 0.173 NA 0.340 0.737
#> 7 3 6 2 lm (Intercept) -0.939 0.610 -1.54 0.199
#> 8 3 6 2 lm a 0.361 0.579 0.623 0.567
#> 9 3 6 2 t_test <NA> 0.633 NA 0.958 0.362
#> 10 4 12 2 lm (Intercept) 0.258 0.251 1.03 0.329
#> 11 4 12 2 lm a 1.26 0.329 3.82 0.00338
#> 12 4 12 2 t_test <NA> -0.300 NA -0.697 0.495
#> # … with 7 more variables: estimate1 <dbl>, estimate2 <dbl>, parameter <dbl>,
#> # conf.low <dbl>, conf.high <dbl>, method <chr>, alternative <chr>
Any option taken by the tidier can be passed through tidy_fits()
. Below, we specify the conf.level
and conf.int
options for broom::tidy.lm()
:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
tidy_fits(conf.level = 0.99, conf.int = TRUE)
#> # A tibble: 8 × 11
#> .sim_id n rep Source term estimate std.error statistic p.value conf.low
#> <int> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm (Int… 0.720 0.498 1.45 0.222 -1.57
#> 2 1 6 1 lm a 0.450 0.754 0.596 0.583 -3.02
#> 3 2 12 1 lm (Int… -0.250 0.234 -1.07 0.310 -0.993
#> 4 2 12 1 lm a 0.664 0.217 3.06 0.0120 -0.0230
#> 5 3 6 2 lm (Int… -0.690 0.701 -0.985 0.380 -3.92
#> 6 3 6 2 lm a 0.274 0.559 0.490 0.650 -2.30
#> 7 4 12 2 lm (Int… 0.283 0.236 1.20 0.258 -0.465
#> 8 4 12 2 lm a 1.38 0.329 4.21 0.00181 0.341
#> # … with 1 more variable: conf.high <dbl>
glance_fits()
analogously provides the one-row summary provided by broom::glance()
for each simulation:
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
glance_fits()
#> # A tibble: 4 × 16
#> .sim_id n rep Source r.squared adj.r.squared sigma statistic p.value
#> <int> <dbl> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 6 1 lm 0.307 0.134 0.650 1.77 0.254
#> 2 2 12 1 lm 0.559 0.515 0.769 12.7 0.00516
#> 3 3 6 2 lm 0.146 -0.0674 1.56 0.684 0.455
#> 4 4 12 2 lm 0.574 0.531 1.01 13.5 0.00433
#> # … with 7 more variables: df <dbl>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> # deviance <dbl>, df.residual <int>, nobs <int>
apply_fits
can take any arbitrary expression (preceded by ~
) or function and apply it to each fit object. The special value .
indicates the current fit.
Below, the maximum Cook’s Distance for each simulated model fit is computed using cooks.distance()
.
specify(a = ~ rnorm(n),
b = ~ a + rnorm(n)) %>%
define(n = c(6, 12)) %>%
generate(2) %>%
fit(lm = ~ lm(b ~ a)) %>%
apply_fits(~ max(cooks.distance(.)))
#> # A tibble: 4 × 5
#> .sim_id n rep Source value
#> <int> <dbl> <int> <chr> <dbl>
#> 1 1 6 1 lm 0.826
#> 2 2 12 1 lm 0.648
#> 3 3 6 2 lm 0.705
#> 4 4 12 2 lm 0.486