library(lsasim)
packageVersion("lsasim")
[1] '2.1.2'
questionnaire_gen(n_obs, cat_prop = NULL, n_vars = NULL, n_X = NULL, n_W = NULL, cor_matrix = NULL,
cov_matrix = NULL, c_mean = NULL, c_sd = NULL, theta = FALSE, family = NULL, full_output = FALSE,
verbose = TRUE)
The function questionnaire_gen
generates correlated continuous and ordinal data which resembles background questionnaire data. The required argument is n_obs
and the optional arguments include
n_obs
: the number of observations (e.g., test takers).cat_prop
: a list of vectors where each vector contains the cumulative proportions for each category of a given item.n_vars
: the number of variables, including the continuous (X
) and the ordinal (W
) covariates as well as the latent trait (theta
).n_X
: the number of continuous (X
) variables.n_W
: the number of ordinal (W
) variables.cor_matrix
: a possibly heterogenous correlation matrix, consisting of polyserial correlations between continuous and ordinal variables, and polychoric correlations between ordinal variables.cov_matrix
: a covariance matrix, formatted as cov_matrix
.The arguments c_mean
and c_sd
are scaling parameters for continuous variables. If the logical argument theta
is TRUE
then the latent trait will be generated as the first continuous variable and labeled ‘theta’. If family
is gaussian
then the data will be generated from a multivariate normal distribution, or the data will be generated from the polychoric correlation matrix.
If the logical argument full_output
is TRUE
, output will be a list containing the questionnaire data as well as several objects that might be of interest for further analysis of the data. The output of full_output
will be addressed in future tutorials.
We only specify n_obs = 100
and use a multivariate normal distribution. It turned out the generated data involves one continuous variable and four ordinal covariates, which are 2-category, 3-category, 4-category, and 5-category, respectively.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 6 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : num -0.6178 1.0299 -0.12 0.0624 1.4585 ...
$ q2 : Factor w/ 2 levels "1","2": 2 1 1 2 1 2 2 2 2 2 ...
$ q3 : Factor w/ 4 levels "1","2","3","4": 2 4 2 2 4 2 4 1 2 4 ...
$ q4 : Factor w/ 3 levels "1","2","3": 2 1 2 2 1 1 3 3 2 2 ...
$ q5 : Factor w/ 5 levels "1","2","3","4",..: 2 1 3 2 2 1 5 4 1 3 ...
In addition to n_obs = 100
, we specify the logical argument theta = TRUE
. An additional continuous variable is generated and labeled theta
. The latent trait is always placed first in the generated data.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, theta = TRUE, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 7 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -1.611 -0.388 1.105 1.618 1.663 ...
$ q1 : num -0.8859 -0.0742 0.9164 -0.7751 -0.3396 ...
$ q2 : Factor w/ 2 levels "1","2": 2 2 2 2 1 1 2 2 2 2 ...
$ q3 : Factor w/ 4 levels "1","2","3","4": 4 1 1 2 4 4 1 4 4 3 ...
$ q4 : Factor w/ 3 levels "1","2","3": 1 3 2 3 1 2 2 1 1 2 ...
$ q5 : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 3 1 4 2 1 1 5 ...
We specify n_vars = 4
regardless the item type. Four different item types are generated, one 1-category item (continuous), one 2-category item, one 4-category item, and one 5-category item.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, n_vars = 4, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : num 0.146 0.83 1.137 0.271 1.115 ...
$ q2 : Factor w/ 5 levels "1","2","3","4",..: 5 1 5 1 4 5 5 3 1 5 ...
$ q3 : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 2 1 1 1 ...
$ q4 : Factor w/ 4 levels "1","2","3","4": 4 4 3 4 2 4 4 4 4 1 ...
In addition to n_vars = 4
, we specify the logical argument theta = TRUE
. Three different item types are generated, two 1-category item (latent trait and continuous), one 2-category item, and one 5-category item. It is noted that when theta = TRUE
, the first continuous variable generated is alwasy labeled theta
.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, n_vars = 4, theta = TRUE, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -0.666 -0.937 -2.229 0.931 -1.438 ...
$ q1 : num -0.353 1.405 1.17 -0.91 0.352 ...
$ q2 : Factor w/ 5 levels "1","2","3","4",..: 4 1 4 4 4 2 5 2 5 5 ...
$ q3 : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 1 1 1 ...
We generate one latent trait and three continuous variables by specifying theta = TRUE
and n_X = 3
. We also add n_W = 0
, or random number of ordinal variables will be generated.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, n_X = 3, n_W = 0, theta = TRUE, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -0.763 -0.822 -0.404 -1.955 0.981 ...
$ q1 : num 0.444 -0.513 2.046 1.441 -0.733 ...
$ q2 : num 0.0349 0.7822 -0.1954 0.9954 -0.203 ...
$ q3 : num -0.3048 -0.3757 1.8951 1.1954 0.0676 ...
set.seed(4388)
<- questionnaire_gen(n_obs = 100, n_X = 3, theta = TRUE, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 10 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num 0.2258 -0.1851 -0.0877 0.0436 0.05 ...
$ q1 : num -0.609 -0.356 0.308 -1.88 -1.009 ...
$ q2 : num 0.954 0.161 1.266 -1.268 0.797 ...
$ q3 : num 0.444 0.229 -0.285 -0.659 1.169 ...
$ q4 : Factor w/ 2 levels "1","2": 1 1 1 2 1 2 1 2 2 1 ...
$ q5 : Factor w/ 4 levels "1","2","3","4": 2 1 2 1 2 4 1 3 2 1 ...
$ q6 : Factor w/ 3 levels "1","2","3": 2 2 2 1 2 3 2 1 1 2 ...
$ q7 : Factor w/ 5 levels "1","2","3","4",..: 1 1 2 5 2 2 3 5 5 3 ...
$ q8 : Factor w/ 4 levels "1","2","3","4": 4 3 3 2 3 1 4 3 3 4 ...
We can also specify cat_prop = list(1, 1, 1, 1)
to generate one latent trait and three continuous covariates. The length of cat_prop
corresponds to the number of generated variables (including latent trait and continuous variables in this case).
set.seed(4388)
<- questionnaire_gen(n_obs = 100, cat_prop = list(1, 1, 1, 1), theta = TRUE, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num -0.763 -0.822 -0.404 -1.955 0.981 ...
$ q1 : num 0.444 -0.513 2.046 1.441 -0.733 ...
$ q2 : num 0.0349 0.7822 -0.1954 0.9954 -0.203 ...
$ q3 : num -0.3048 -0.3757 1.8951 1.1954 0.0676 ...
We generate two ordinal variables regardless the item type. It turned out one 2-category variable and one 5-category variable are generated, respectively.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, n_X = 0, n_W = 2, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 3 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : Factor w/ 2 levels "1","2": 2 1 1 1 1 1 1 2 2 1 ...
$ q2 : Factor w/ 5 levels "1","2","3","4",..: 1 4 5 3 5 4 2 1 1 1 ...
We generate one binary variable and 3 four-category variables.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, n_X = 0, n_W = list(2, 4, 4, 4), family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 5 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : Factor w/ 2 levels "1","2": 1 2 2 2 1 2 1 1 1 2 ...
$ q2 : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 4 4 1 4 ...
$ q3 : Factor w/ 4 levels "1","2","3","4": 3 2 3 1 1 1 3 4 1 1 ...
$ q4 : Factor w/ 4 levels "1","2","3","4": 2 1 1 2 4 4 4 4 3 4 ...
We generate five variables including one latent trait, two continuous, and two binary covariates. The latent trait is scaled on a mean set at 500, with a standard deviation of 100.
set.seed(4388)
<- questionnaire_gen(n_obs = 100, n_X = 2, n_W = list(2, 2), theta = TRUE, c_mean = c(500,
bg 0, 0), c_sd = c(100, 1, 1), family = "gaussian")
str(bg)
'data.frame': 100 obs. of 6 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ theta : num 515 612 578 476 437 ...
$ q1 : num 0.0731 -0.8194 -0.8648 -0.1415 0.7484 ...
$ q2 : num -0.0166 1.4975 0.596 0.4905 0.482 ...
$ q3 : Factor w/ 2 levels "1","2": 2 2 2 1 1 1 1 2 1 2 ...
$ q4 : Factor w/ 2 levels "1","2": 2 2 1 1 2 1 1 1 2 1 ...
We generate one continuous and two ordinal covariates. We specify the covariance matrix between the numeric and ordinal variables. The continuous covariate is scaled and the average is 2 by specifying c_mean = 2
. When cov_matrix
is provided, c_sd
is ignored .
set.seed(4388)
<- list(1, c(0.25, 1), c(0.2, 0.8, 1))
props <- matrix(c(1, 0.5, 0.5, 0.5, 1, 0.8, 0.5, 0.8, 1), nrow = 3)
yw_cov <- questionnaire_gen(n_obs = 100, cat_prop = props, cov_matrix = yw_cov, c_mean = 2, family = "gaussian")
bg str(bg)
'data.frame': 100 obs. of 4 variables:
$ subject: int 1 2 3 4 5 6 7 8 9 10 ...
$ q1 : num 1.878 3.746 2.938 2.386 0.768 ...
$ q2 : Factor w/ 2 levels "1","2": 1 2 2 2 1 2 2 2 1 1 ...
$ q3 : Factor w/ 3 levels "1","2","3": 1 2 2 2 2 3 1 3 2 2 ...