simpr

`specify()` your data-generating process

specify() takes arbitrary R expressions that can be used for generating data. Each argument should have a name and be prefixed with ~, the tilde operator. Order matters: later arguments can refer to earlier arguments, but not the other way around.

Good—b specification refers to a specification, and comes after a:

specify(a = ~ runif(6), 
        b = ~ a + rnorm(6)) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#>       a      b
#>   <dbl>  <dbl>
#> 1 0.532  0.600
#> 2 0.946  0.978
#> 3 0.451  1.86 
#> 4 0.249 -1.02 
#> 5 0.376  0.409
#> 6 0.824  0.433

Error—b specification refers to a specification, but comes before a, so generate() doesn’t know what a is:

specify(b = ~ a + rnorm(6),
        a = ~ runif(6)) %>% 
  generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#>   .sim_id   rep sim    .sim_error                                   
#>     <int> <int> <list> <chr>                                        
#> 1       1     1 <NULL> "Error in eval_fun(): object 'a' not found\n"

All arguments must imply the same number of rows. Arguments that imply 1 row are recycled.

OK—both a and b imply 6 rows:

specify(a = ~ runif(6), 
        b = ~ rnorm(6)) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#>       a       b
#>   <dbl>   <dbl>
#> 1 0.376  1.41  
#> 2 0.824 -1.27  
#> 3 0.527  0.0321
#> 4 0.725 -0.392 
#> 5 0.513  0.205 
#> 6 0.149 -0.186

OK—a implies 1 row and b implies 6 rows, so a is recycled:

specify(a = ~ runif(1), 
        b = ~ rnorm(6)) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#>       a       b
#>   <dbl>   <dbl>
#> 1 0.527  0.597 
#> 2 0.527 -1.04  
#> 3 0.527 -0.0124
#> 4 0.527 -0.0558
#> 5 0.527  0.457 
#> 6 0.527 -0.578

Error—a implies 2 rows and b implies 6 rows:

specify(a = ~ runif(2), 
        b = ~ rnorm(6)) %>% 
  generate(1)
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#>   .sim_id   rep sim    .sim_error                                               
#>     <int> <int> <list> <chr>                                                    
#> 1       1     1 <NULL> "Error in `stop_vctrs()`:\n! Can't recycle `..1` (size 2…

Using x as an argument to specify() is not recommended, because for technical reasons x is always placed as the first argument. This means that if x refers to prior variables it will return an error:

specify(y = ~ runif(6),
        x = ~ y + runif(6)) %>% 
  generate(1)
#> Formula specification for 'x' detected. Assuming 'x' is the first formula.
#> 
#> To hide this message, or to avoid moving this formula first, use a different variable name.
#> Warning in create_sim_results(specs = specs, x = x[c("meta_info", "specify", :
#> Simulation produced errors. See column '.sim_error'.
#> tibble
#> --------------------------
#> # A tibble: 1 × 4
#>   .sim_id   rep sim    .sim_error                                   
#>     <int> <int> <list> <chr>                                        
#> 1       1     1 <NULL> "Error in eval_fun(): object 'y' not found\n"

The same specification works fine when x is renamed:

specify(y = ~ runif(6),
        a = ~ y + runif(6)) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#>       y     a
#>   <dbl> <dbl>
#> 1 0.102 0.683
#> 2 0.478 0.663
#> 3 0.513 0.939
#> 4 0.676 1.28 
#> 5 0.348 0.976
#> 6 0.282 0.686

Advanced: expressions that generate multiple columns

specify() accepts expressions that generate multiple columns simultaneously in a matrix, data.frame, or tibble. By default, the column names in the output append a number to the variable name.

Here’s an example using MASS::mvrnorm(), which returns draws from the multivariate normal distribution as a matrix. MASS::mvrnorm() determines the number of columns for the output data from the length of mu and the dimensions of the variance-covariance matrix Sigma.

specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3)))) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 3]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#>      a_1    a_2     a_3
#>    <dbl>  <dbl>   <dbl>
#> 1  0.477 -0.445  0.0321
#> 2  0.236 -0.280 -0.392 
#> 3 -0.656  0.841  0.205 
#> 4  1.80   1.51  -0.186 
#> 5 -0.458 -0.597  0.327 
#> 6  0.130  0.549  0.199

The argument was named a in specify(), so generate() creates three variables named a_1, a_2, and a_3.

You can change the separator between the argument name and the number via .sep. Here, we change it from the default "_" to ".":

specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3))),
        .sep = ".") %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 3]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#>      a.1    a.2    a.3
#>    <dbl>  <dbl>  <dbl>
#> 1  0.236 -0.280 -0.392
#> 2 -0.656  0.841  0.205
#> 3  1.80   1.51  -0.186
#> 4 -0.458 -0.597  0.327
#> 5  0.130  0.549  0.199
#> 6  0.155  0.477 -0.445

Alternatively, you can give a two-sided formula to set names. The argument name is ignored, and the left hand side must use c or cbind.

specify(y = c(a, b, c) ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3)))) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 3]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#>        a      b      c
#>    <dbl>  <dbl>  <dbl>
#> 1 -0.656  0.841  0.205
#> 2  1.80   1.51  -0.186
#> 3 -0.458 -0.597  0.327
#> 4  0.130  0.549  0.199
#> 5  0.155  0.477 -0.445
#> 6  0.201  0.236 -0.280

If your expression already produces column names, those are used by default. The argument name is again ignored:

specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3))) %>% 
          magrittr::set_colnames(c("d", "e", "f"))) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 3]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#>        d      e      f
#>    <dbl>  <dbl>  <dbl>
#> 1  1.80   1.51  -0.186
#> 2 -0.458 -0.597  0.327
#> 3  0.130  0.549  0.199
#> 4  0.155  0.477 -0.445
#> 5  0.201  0.236 -0.280
#> 6  0.153 -0.656  0.841

This is useful for dealing with functions from other packages that already provide informative names (e.g., lavaan::simulateData()). You can turn this behavior off with .use_names = FALSE.

Whatever method you use, you can still refer to these generated variables in subsequent arguments to specify():

specify(a = ~ MASS::mvrnorm(6, 
                            mu = rep(0, 3),
                            Sigma = diag(rep(1, 3))),
        b = ~ a_1 - a_2) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 4]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 4
#>      a_1    a_2    a_3       b
#>    <dbl>  <dbl>  <dbl>   <dbl>
#> 1 -0.458 -0.597  0.327  0.139 
#> 2  0.130  0.549  0.199 -0.420 
#> 3  0.155  0.477 -0.445 -0.323 
#> 4  0.201  0.236 -0.280 -0.0347
#> 5  0.153 -0.656  0.841  0.810 
#> 6  1.32   1.80   1.51  -0.480

`define()` parameters that you want to systematically vary

define() creates metaparameters (also called simulation factors): values that you want to systematically vary. An obvious choice is sample size.

Instead of writing a number for the n argument of rnorm, we write a placeholder value samp_size (this can be any valid R name), and we write a corresponding argument in define() that contains the possible values that samp_size can take on:

specify(a = ~ rnorm(samp_size)) %>% 
  define(samp_size = c(10, 20)) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 4
#>   .sim_id samp_size   rep sim              
#>     <int>     <dbl> <int> <list>           
#> 1       1        10     1 <tibble [10 × 1]>
#> 2       2        20     1 <tibble [20 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#>         a
#>     <dbl>
#>  1  0.199
#>  2 -0.445
#>  3 -0.280
#>  4  0.841
#>  5  1.51 
#>  6 -0.597
#>  7  0.549
#>  8  0.477
#>  9  0.236
#> 10 -0.656

Each argument to define() is a vector or list with the desired values to be used in the expressions in specify(). specify() expressions can refer to the names of define arguments, and generate() will substitute the possible values that argument when generating data.

When define() has multiple arguments, each possible combination is generated:

specify(a = ~ rnorm(samp_size, mu)) %>% 
  define(samp_size = c(10, 20),
         mu = c(0, 10)) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 4 × 5
#>   .sim_id samp_size    mu   rep sim              
#>     <int>     <dbl> <dbl> <int> <list>           
#> 1       1        10     0     1 <tibble [10 × 1]>
#> 2       2        20     0     1 <tibble [20 × 1]>
#> 3       3        10    10     1 <tibble [10 × 1]>
#> 4       4        20    10     1 <tibble [20 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 10 × 1
#>         a
#>     <dbl>
#>  1 -0.445
#>  2 -0.280
#>  3  0.841
#>  4  1.51 
#>  5 -0.597
#>  6  0.549
#>  7  0.477
#>  8  0.236
#>  9 -0.656
#> 10  1.80

(If not all combinations are desired, see options for filtering at the generate() step, below.)

define() can also take lists for any type of value used in specify. For instance, the argument s is defined as a list with two variables representing two different possible correlation matrices, and we use that placeholder value s in the specify() statement:


specify(a = ~ MASS::mvrnorm(6, rep(0, 2), Sigma = s)) %>% 
  define(s = list(independent = diag(rep(1, 2)),
                  dependent = matrix(c(1, 0.5, 0.5, 1), nrow = 2))) %>% 
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#>   .sim_id s_index       rep s             sim             
#>     <int> <chr>       <int> <list>        <list>          
#> 1       1 independent     1 <dbl [2 × 2]> <tibble [6 × 2]>
#> 2       2 dependent       1 <dbl [2 × 2]> <tibble [6 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#>      a_1    a_2
#>    <dbl>  <dbl>
#> 1 -0.236 -0.280
#> 2  0.656  0.841
#> 3 -1.80   1.51 
#> 4  0.458 -0.597
#> 5 -0.130  0.549
#> 6 -0.155  0.477

In the output, simpr creates a column s_index using the names of the list elements of s to make the results easier to view and filter.

define() also supports lists of functions. Here, the specify command has a placeholder function distribution, where distribution is defined to be either rnorm() or rlnorm() (the lognormal distribution):

specify(y = ~ distribution(6)) %>%
  define(distribution = list(normal = rnorm,
                             lognormal = rlnorm)) %>%
  generate(1)
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#>   .sim_id distribution_index   rep distribution sim             
#>     <int> <chr>              <int> <list>       <list>          
#> 1       1 normal                 1 <fn>         <tibble [6 × 1]>
#> 2       2 lognormal              1 <fn>         <tibble [6 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        y
#>    <dbl>
#> 1  0.841
#> 2  1.51 
#> 3 -0.597
#> 4  0.549
#> 5  0.477
#> 6  0.236

`generate()` the simulation data

The main argument to generate() is .reps, the number of repetitions for each combination of metaparameters in define().

specify(a = ~ rnorm(n, mu)) %>% 
  define(n = c(6, 12),
         mu = c(0, 10)) %>% 
  generate(2)
#> full tibble
#> --------------------------
#> # A tibble: 8 × 5
#>   .sim_id     n    mu   rep sim              
#>     <int> <dbl> <dbl> <int> <list>           
#> 1       1     6     0     1 <tibble [6 × 1]> 
#> 2       2    12     0     1 <tibble [12 × 1]>
#> 3       3     6    10     1 <tibble [6 × 1]> 
#> 4       4    12    10     1 <tibble [12 × 1]>
#> 5       5     6     0     2 <tibble [6 × 1]> 
#> 6       6    12     0     2 <tibble [12 × 1]>
#> 7       7     6    10     2 <tibble [6 × 1]> 
#> 8       8    12    10     2 <tibble [12 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1  1.51 
#> 2 -0.597
#> 3  0.549
#> 4  0.477
#> 5  0.236
#> 6 -0.656

Since there are 4 possible combinations of n and mu, there are a total of 4 * 2 = 8 simulations generated, 2 for each possible combination.

If some combination of variables is not desired, add filtering criteria to generate() using the same syntax as dplyr::filter(). Here we arbitrarily filter to all combinations of n and mu where n is greater than mu, but any valid filtering criteria can be applied.

specify(a = ~ rnorm(n, mu)) %>% 
  define(n = c(6, 12),
         mu = c(0, 10)) %>% 
  generate(2, n > mu)
#> full tibble
#> --------------------------
#> # A tibble: 6 × 5
#>   .sim_id     n    mu   rep sim              
#>     <int> <dbl> <dbl> <int> <list>           
#> 1       1     6     0     1 <tibble [6 × 1]> 
#> 2       2    12     0     1 <tibble [12 × 1]>
#> 3       4    12    10     1 <tibble [12 × 1]>
#> 4       5     6     0     2 <tibble [6 × 1]> 
#> 5       6    12     0     2 <tibble [12 × 1]>
#> 6       8    12    10     2 <tibble [12 × 1]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1 -0.597
#> 2  0.549
#> 3  0.477
#> 4  0.236
#> 5 -0.656
#> 6  1.80

To preserve reproducibility, a given simulation in the filtered version of generate is still the same as if all possible combinations were generated. This can be tracked using the .sim_id that generate() includes in the output data, which uniquely identifies the simulation run given the same inputs to specify, define, and .reps. Above, note that .sim_id skips 3 and 7 See vignette("Reproducing simulations") for more information on using generate() to filter.

generate() by default continues with the next iteration if an error is produced, and returns a column .sim_error with the text of the error. Alternative error handling mechanisms are available, see vignette("Managing simulation errors").

`fit()` models to your data

fit() uses similar syntax to generate(): you can write arbitrary R expressions to fit models. Again, each argument should have a name and be prefixed with ~, the tilde operator.

Below, we fit both a t-test and a linear model.

specify(a = ~ rnorm(6),
        b = ~ a + rnorm(6)) %>% 
  generate(1) %>% 
  fit(t_test = ~ t.test(a, b),
      lm = ~ lm(b ~ a))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#>   .sim_id   rep sim              t_test  lm    
#>     <int> <int> <list>           <list>  <list>
#> 1       1     1 <tibble [6 × 2]> <htest> <lm>  
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#>        a      b
#>    <dbl>  <dbl>
#> 1  0.549  0.679
#> 2  0.477  0.632
#> 3  0.236  0.437
#> 4 -0.656 -0.503
#> 5  1.80   3.13 
#> 6 -0.458  0.412
#> 
#> t_test[[1]]
#> --------------------------
#> 
#>  Welch Two Sample t-test
#> 
#> data:  a and b
#> t = -0.76979, df = 9.0768, p-value = 0.461
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -1.8583360  0.9137887
#> sample estimates:
#> mean of x mean of y 
#> 0.3255341 0.7978077 
#> 
#> 
#> lm[[1]]
#> --------------------------
#> 
#> Call:
#> lm(formula = b ~ a)
#> 
#> Coefficients:
#> (Intercept)            a  
#>      0.3739       1.3021

fit() adds columns to the overall tibble to contain each type of fit. Printing the object displays a preview of the first fit object in each column.

Although the function is named fit, any arbitrary R expression can be used. Below, one fit column will include the mean of a, while the second will include vectors that are five larger than a. The result is always a list-column, so any type of return object is allowed:

specify(a = ~ rnorm(6)) %>% 
  generate(1) %>% 
  fit(mean = ~ mean(a),
      why_not = ~ a + 5)
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#>   .sim_id   rep sim              mean      why_not  
#>     <int> <int> <list>           <list>    <list>   
#> 1       1     1 <tibble [6 × 1]> <dbl [1]> <dbl [6]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 1
#>        a
#>    <dbl>
#> 1  0.477
#> 2  0.236
#> 3 -0.656
#> 4  1.80 
#> 5 -0.458
#> 6  0.130
#> 
#> mean[[1]]
#> --------------------------
#> [1] 0.2555629
#> 
#> why_not[[1]]
#> --------------------------
#> [1] 5.477269 5.236078 4.343804 6.804318 4.542280 5.129628

fit() is computed for each individual simulated dataset, so usually you do not need to refer to the dataset itself. If a reference to the dataset is needed, use ..

The below code is equivalent to the previous example, but explicitly referencing the dataset using . and .$:

specify(a = ~ rnorm(6),
        b = ~ a + rnorm(6)) %>% 
  generate(1) %>% 
  ## .$ and data = . not actually required here
  fit(t_test = ~ t.test(.$a, .$b),
      lm = ~ lm(b ~ a, data = .))
#> full tibble
#> --------------------------
#> # A tibble: 1 × 5
#>   .sim_id   rep sim              t_test  lm    
#>     <int> <int> <list>           <list>  <list>
#> 1       1     1 <tibble [6 × 2]> <htest> <lm>  
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 2
#>        a      b
#>    <dbl>  <dbl>
#> 1  0.236  0.437
#> 2 -0.656 -0.503
#> 3  1.80   3.13 
#> 4 -0.458  0.412
#> 5  0.130  1.77 
#> 6  0.155  0.740
#> 
#> t_test[[1]]
#> --------------------------
#> 
#>  Welch Two Sample t-test
#> 
#> data:  .$a and .$b
#> t = -1.2662, df = 8.8051, p-value = 0.2379
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#>  -2.2239947  0.6312032
#> sample estimates:
#> mean of x mean of y 
#> 0.2018126 0.9982084 
#> 
#> 
#> lm[[1]]
#> --------------------------
#> 
#> Call:
#> lm(formula = b ~ a, data = .)
#> 
#> Coefficients:
#> (Intercept)            a  
#>      0.7276       1.3411

Advanced: pre-fit data munging with `per_sim()

Sometimes data manipulation is required between generate() and fit(). After generate(), run per_sim() and then chain any dplyr or tidyr verbs that work on data.frames or tibbles. These verbs will be applied to every individual simulated dataset.

A common use-case is needing to reshape wide to long. Consider an intervention study with a control group, an intervention group that does slightly better than the control, and a second intervention group that does much better. This is easiest to specify in wide format, with separate variables for each group and differing means by group:

wide_gen = specify(control = ~ rnorm(6, mean = 0),
        intervention_1 = ~ rnorm(6, mean = 0.2),
        intervention_2 = ~ rnorm(6, mean = 2)) %>% 
  generate(2) 

wide_gen
#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#>   .sim_id   rep sim             
#>     <int> <int> <list>          
#> 1       1     1 <tibble [6 × 3]>
#> 2       2     2 <tibble [6 × 3]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 6 × 3
#>   control intervention_1 intervention_2
#>     <dbl>          <dbl>          <dbl>
#> 1  -0.656          0.353           2.94
#> 2   1.80           1.52            2.03
#> 3  -0.458          1.07            2.30
#> 4   0.130          1.84            1.50
#> 5   0.155          0.785           1.70
#> 6   0.201         -0.539           1.65

But to run an ANOVA, we need the outcome in one column and the group name in another column. We first run per_sim() to indicate we want to compute on individual simulated datasets, and then we can use tidyr::pivot_longer() to reshape each dataset into a format ready for analysis:

long_gen = wide_gen %>%  
  per_sim() %>% 
  pivot_longer(cols = everything(),
               names_to = "group", 
               values_to = "response")

long_gen
#> full tibble
#> --------------------------
#> # A tibble: 2 × 3
#>   .sim_id   rep sim              
#>     <int> <int> <list>           
#> 1       1     1 <tibble [18 × 2]>
#> 2       2     2 <tibble [18 × 2]>
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#>    group          response
#>    <chr>             <dbl>
#>  1 control          -0.656
#>  2 intervention_1    0.353
#>  3 intervention_2    2.94 
#>  4 control           1.80 
#>  5 intervention_1    1.52 
#>  6 intervention_2    2.03 
#>  7 control          -0.458
#>  8 intervention_1    1.07 
#>  9 intervention_2    2.30 
#> 10 control           0.130
#> 11 intervention_1    1.84 
#> 12 intervention_2    1.50 
#> 13 control           0.155
#> 14 intervention_1    0.785
#> 15 intervention_2    1.70 
#> 16 control           0.201
#> 17 intervention_1   -0.539
#> 18 intervention_2    1.65

Each simulation is now reshaped and ready for fitting:

long_fit = long_gen %>% 
  fit(aov = ~ aov(response ~ group),
      lm = ~ lm(response ~ group))

long_fit
#> full tibble
#> --------------------------
#> # A tibble: 2 × 5
#>   .sim_id   rep sim               aov    lm    
#>     <int> <int> <list>            <list> <list>
#> 1       1     1 <tibble [18 × 2]> <aov>  <lm>  
#> 2       2     2 <tibble [18 × 2]> <aov>  <lm>  
#> 
#> sim[[1]]
#> --------------------------
#> # A tibble: 18 × 2
#>    group          response
#>    <chr>             <dbl>
#>  1 control          -0.656
#>  2 intervention_1    0.353
#>  3 intervention_2    2.94 
#>  4 control           1.80 
#>  5 intervention_1    1.52 
#>  6 intervention_2    2.03 
#>  7 control          -0.458
#>  8 intervention_1    1.07 
#>  9 intervention_2    2.30 
#> 10 control           0.130
#> 11 intervention_1    1.84 
#> 12 intervention_2    1.50 
#> 13 control           0.155
#> 14 intervention_1    0.785
#> 15 intervention_2    1.70 
#> 16 control           0.201
#> 17 intervention_1   -0.539
#> 18 intervention_2    1.65 
#> 
#> aov[[1]]
#> --------------------------
#> Call:
#>    aov(formula = response ~ group)
#> 
#> Terms:
#>                     group Residuals
#> Sum of Squares  10.270758  8.853192
#> Deg. of Freedom         2        15
#> 
#> Residual standard error: 0.7682531
#> Estimated effects may be unbalanced
#> 
#> lm[[1]]
#> --------------------------
#> 
#> Call:
#> lm(formula = response ~ group)
#> 
#> Coefficients:
#>         (Intercept)  groupintervention_1  groupintervention_2  
#>              0.1960               0.6437               1.8242

`tidy_fits()` for further processing

The output of fit() is not yet amenable to plotting or analysis. tidy_fits() applies broom::tidy() to each fit object and binds them together in a single tibble:

specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  tidy_fits()
#> # A tibble: 8 × 9
#>   .sim_id     n   rep Source term        estimate std.error statistic p.value
#>     <int> <dbl> <int> <chr>  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#> 1       1     6     1 lm     (Intercept)    0.684     0.406     1.69  0.167  
#> 2       1     6     1 lm     a              1.26      0.526     2.40  0.0745 
#> 3       2    12     1 lm     (Intercept)   -0.117     0.288    -0.407 0.693  
#> 4       2    12     1 lm     a              0.739     0.255     2.90  0.0160 
#> 5       3     6     2 lm     (Intercept)   -0.713     0.715    -0.997 0.375  
#> 6       3     6     2 lm     a              0.503     0.687     0.731 0.505  
#> 7       4    12     2 lm     (Intercept)    0.335     0.238     1.41  0.190  
#> 8       4    12     2 lm     a              1.30      0.317     4.11  0.00210

All the same metaparameter information appears in the left-hand column, and all the model information from broom::tidy() is provided. This is a convenient format for filtering, plotting, and calculating diagnostics.

If more than one kind of fit is present, tidy_fits() simply brings them together in the same tibble using bind_rows; this means there may be many NA values where one type of model has no values. In the example below, t.test returns the column estimate1, but lm does not, so for the lm rows there are NA values.

specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a),
      t_test = ~ t.test(a, b)) %>% 
  tidy_fits()
#> # A tibble: 12 × 16
#>    .sim_id     n   rep Source term        estimate std.error statistic p.value
#>      <int> <dbl> <int> <chr>  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#>  1       1     6     1 lm     (Intercept)    0.688     0.385     1.79  0.149  
#>  2       1     6     1 lm     a              0.471     0.656     0.718 0.513  
#>  3       1     6     1 t_test <NA>          -0.555    NA        -1.36  0.206  
#>  4       2    12     1 lm     (Intercept)   -0.161     0.279    -0.578 0.576  
#>  5       2    12     1 lm     a              0.803     0.248     3.23  0.00897
#>  6       2    12     1 t_test <NA>           0.173    NA         0.340 0.737  
#>  7       3     6     2 lm     (Intercept)   -0.939     0.610    -1.54  0.199  
#>  8       3     6     2 lm     a              0.361     0.579     0.623 0.567  
#>  9       3     6     2 t_test <NA>           0.633    NA         0.958 0.362  
#> 10       4    12     2 lm     (Intercept)    0.258     0.251     1.03  0.329  
#> 11       4    12     2 lm     a              1.26      0.329     3.82  0.00338
#> 12       4    12     2 t_test <NA>          -0.300    NA        -0.697 0.495  
#> # … with 7 more variables: estimate1 <dbl>, estimate2 <dbl>, parameter <dbl>,
#> #   conf.low <dbl>, conf.high <dbl>, method <chr>, alternative <chr>

Any option taken by the tidier can be passed through tidy_fits(). Below, we specify the conf.level and conf.int options for broom::tidy.lm():

specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  tidy_fits(conf.level = 0.99, conf.int = TRUE)
#> # A tibble: 8 × 11
#>   .sim_id     n   rep Source term  estimate std.error statistic p.value conf.low
#>     <int> <dbl> <int> <chr>  <chr>    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>
#> 1       1     6     1 lm     (Int…    0.720     0.498     1.45  0.222    -1.57  
#> 2       1     6     1 lm     a        0.450     0.754     0.596 0.583    -3.02  
#> 3       2    12     1 lm     (Int…   -0.250     0.234    -1.07  0.310    -0.993 
#> 4       2    12     1 lm     a        0.664     0.217     3.06  0.0120   -0.0230
#> 5       3     6     2 lm     (Int…   -0.690     0.701    -0.985 0.380    -3.92  
#> 6       3     6     2 lm     a        0.274     0.559     0.490 0.650    -2.30  
#> 7       4    12     2 lm     (Int…    0.283     0.236     1.20  0.258    -0.465 
#> 8       4    12     2 lm     a        1.38      0.329     4.21  0.00181   0.341 
#> # … with 1 more variable: conf.high <dbl>

glance_fits() analogously provides the one-row summary provided by broom::glance() for each simulation:

specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  glance_fits()
#> # A tibble: 4 × 16
#>   .sim_id     n   rep Source r.squared adj.r.squared sigma statistic p.value
#>     <int> <dbl> <int> <chr>      <dbl>         <dbl> <dbl>     <dbl>   <dbl>
#> 1       1     6     1 lm         0.307        0.134  0.650     1.77  0.254  
#> 2       2    12     1 lm         0.559        0.515  0.769    12.7   0.00516
#> 3       3     6     2 lm         0.146       -0.0674 1.56      0.684 0.455  
#> 4       4    12     2 lm         0.574        0.531  1.01     13.5   0.00433
#> # … with 7 more variables: df <dbl>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>, nobs <int>

apply_fits can take any arbitrary expression (preceded by ~) or function and apply it to each fit object. The special value . indicates the current fit.

Below, the maximum Cook’s Distance for each simulated model fit is computed using cooks.distance().

specify(a = ~ rnorm(n),
        b = ~ a + rnorm(n)) %>% 
  define(n = c(6, 12)) %>% 
  generate(2) %>% 
  fit(lm = ~ lm(b ~ a)) %>% 
  apply_fits(~ max(cooks.distance(.)))
#> # A tibble: 4 × 5
#>   .sim_id     n   rep Source value
#>     <int> <dbl> <int> <chr>  <dbl>
#> 1       1     6     1 lm     0.826
#> 2       2    12     1 lm     0.648
#> 3       3     6     2 lm     0.705
#> 4       4    12     2 lm     0.486

simpr

specify() your data-generating process

Advanced: expressions that generate multiple columns

define() parameters that you want to systematically vary

generate() the simulation data