In the figure below, you can see a hypothetical structural model with its standardized loadings and path coefficients.
Suppose you need to simulate multivariate normal data based on this model, but you do not know the error variances and the latent disturbance variances needed to make your model produce standardized data. It is often difficult to find such values algebraically, and instead they must be found iteratively.
The simstandard package finds the standardized variances and creates standardized multivariate normal data using lavaan syntax. It can also create latent variable scores, error terms, disturbance terms, estimated factor scores, and equally weighted composite scores for each latent variable.
library(simstandard)
library(lavaan)
library(knitr)
library(dplyr)
library(ggplot2)
library(tibble)
library(tidyr)
# lavaan syntax for model
<- "
m A =~ 0.7 * A1 + 0.8 * A2 + 0.9 * A3 + 0.3 * B1
B =~ 0.7 * B1 + 0.8 * B2 + 0.9 * B3
B ~ 0.6 * A
"
# Simulate data
<- sim_standardized(m, n = 100000)
d
# Display First 6 rows
head(d)
A1 | A2 | A3 | B1 | B2 | B3 | A | B | e_A1 | e_A2 | e_A3 | e_B1 | e_B2 | e_B3 | d_B |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.51 | 0.88 | 1.03 | 2.38 | 3.58 | 2.87 | 1.31 | 2.79 | 0.60 | -0.17 | -0.15 | 0.04 | 1.35 | 0.36 | 2.00 |
1.57 | 0.59 | 0.51 | 0.33 | 0.51 | 1.12 | 1.05 | 0.68 | 0.83 | -0.26 | -0.43 | -0.46 | -0.03 | 0.51 | 0.05 |
-0.50 | 0.59 | 0.76 | 0.36 | -0.31 | 0.70 | 0.04 | 0.17 | -0.52 | 0.56 | 0.73 | 0.23 | -0.45 | 0.55 | 0.15 |
-0.27 | -0.94 | -0.76 | -0.29 | 0.22 | -0.68 | -0.86 | -0.46 | 0.33 | -0.26 | 0.01 | 0.29 | 0.58 | -0.27 | 0.06 |
-0.02 | -0.81 | 0.45 | -0.23 | 0.72 | 0.20 | 1.04 | -0.27 | -0.75 | -1.65 | -0.49 | -0.35 | 0.94 | 0.44 | -0.89 |
-1.44 | -0.19 | -0.79 | -1.13 | -0.44 | -0.82 | -0.97 | -0.67 | -0.77 | 0.58 | 0.07 | -0.37 | 0.09 | -0.22 | -0.09 |
Let’s make a function to display correlations and covariance matrices:
<- function(d) {
ggcor require(ggplot2)
as.data.frame(d) %>%
::rownames_to_column("rowname") %>%
tibble::pivot_longer(-rowname, names_to = "colname", values_to = "r") %>%
tidyr::mutate(rowname = forcats::fct_inorder(rowname) %>% forcats::fct_rev()) %>%
dplyr::mutate(colname = factor(colname,
dplyrlevels = rev(levels(rowname)))) %>%
ggplot(aes(colname, rowname, fill = r)) +
geom_tile(color = "gray90") +
geom_text(aes(label = formatC(r, digits = 2, format = "f") %>%
::str_replace_all("0\\.",".") %>%
stringr::str_replace_all("1.00","1")),
stringrcolor = "white",
fontface = "bold",
family = "serif") +
scale_fill_gradient2(NULL,
na.value = "gray20",
limits = c(-1.01, 1.01),
high = "#924552",
low = "#293999") +
coord_equal() +
scale_x_discrete(NULL,position = "top") +
scale_y_discrete(NULL) +
theme_light(base_family = "serif", base_size = 14)
}
Because the data are standardized, the covariance matrix of the observed and latent variables should be nearly identical to a correlation matrix. The error and disturbance terms are not standardized.
cov(d) %>%
ggcor
To return only the observed variables
<- sim_standardized(m,
d n = 100000,
latent = FALSE,
errors = FALSE)
# Display First 6 rows
head(d)
A1 | A2 | A3 | B1 | B2 | B3 |
---|---|---|---|---|---|
1.61 | 1.15 | 2.06 | 0.90 | -0.05 | -0.08 |
1.46 | 1.50 | 0.68 | 0.12 | -0.48 | 0.34 |
-0.89 | -1.19 | -1.22 | -0.52 | 0.40 | -0.73 |
0.13 | -1.55 | -0.79 | -1.00 | -0.17 | -1.18 |
-0.93 | -1.64 | -0.96 | -1.23 | -1.51 | -1.01 |
-0.43 | -0.24 | -0.10 | -0.87 | -0.78 | -1.21 |
lavaan::simulateData
I love the lavaan package. However, one aspect of one function in lavaan is not quite right yet. lavaan’s simulateData
function is known to generate non-standardized data, even when the standardized
parameter is set to TRUE
. See how it creates variable Y with a variance higher than 1.
<- "
test_model Y ~ -.75 * X_1 + .25 * X_2
X =~ .75 * X_1 + .75 * X_2
"
library(lavaan)
<- simulateData(
d_lavaan model = test_model,
sample.nobs = 100000,
standardized = TRUE)
cov(d_lavaan) %>%
ggcor
With the same test model, simstandard will calculate variables with variances of 1.
sim_standardized(test_model,
n = 100000,
errors = FALSE) %>%
%>%
cov ggcor()
You can inspect the matrices that simstandard uses to create the data by calling simstandardized_matrices
.
<- sim_standardized_matrices(m) matrices
The A matrix contains all the asymmetric path coefficients (i.e., the loadings and the structural coefficients). These coefficients are specified in the lavaan model syntax.
$RAM_matrices$A %>%
matricesggcor()
The S matrix contains all the symmetric path coefficients (i.e., the variances and correlations of the observed and latent variables). For endogenous variables, the variances and correlations refer to the variance and correlations of the variable’s associated error or disturbance term. In this case, A is the only endogenous variable, and thus its variance on the diagonal of the S matrix is 1.
$RAM_matrices$S %>%
matricesggcor()
Thus, we can use these results to insert the missing values from the path diagram at the beginning of this tutorial
If you want to estimate factor scores using the regression method (i.e., Thurstone’s method), set factor_scores
to TRUE
. All scores ending in FS are factor score estimates.
<- "
m A =~ 0.9 * A1 + 0.8 * A2 + 0.7 * A3
"
sim_standardized(
m, n = 100000,
factor_scores = TRUE
%>%
) head()
A1 | A2 | A3 | A | e_A1 | e_A2 | e_A3 | A_FS |
---|---|---|---|---|---|---|---|
-0.19 | -0.19 | -0.29 | 0.04 | -0.22 | -0.22 | -0.32 | -0.21 |
-1.75 | -0.50 | -0.41 | -1.50 | -0.40 | 0.70 | 0.64 | -1.24 |
-0.96 | -0.85 | -2.47 | -1.10 | 0.03 | 0.03 | -1.70 | -1.23 |
0.84 | 1.37 | -0.98 | 1.30 | -0.34 | 0.32 | -1.89 | 0.71 |
-0.78 | -0.41 | -1.96 | -1.31 | 0.40 | 0.63 | -1.05 | -0.91 |
1.03 | -0.24 | 0.50 | 0.47 | 0.61 | -0.61 | 0.18 | 0.63 |
Suppose you have some new data and wish to add estimated factor scores to it. The add_factor_scores
function will take your data and return your data with the estimated factors added to it.
<- tibble::tribble(
d ~A1, ~A2, ~A3,
2.5, 1.3,
2L, -1L, -1.5, -2.1
)
add_factor_scores(d, m )
A1 | A2 | A3 | A_FS |
---|---|---|---|
2 | 2.5 | 1.3 | 2.1 |
-1 | -1.5 | -2.1 | -1.4 |
If you want to calculate equally-weighted composite scores based on the indicators of each latent variable, set `composites = TRUE’.
<- "
m A =~ 0.9 * A1 + 0.8 * A2 + 0.7 * A3
"
sim_standardized(
m, n = 100000,
composites = TRUE
%>%
) head()
A1 | A2 | A3 | A | e_A1 | e_A2 | e_A3 | A_Composite |
---|---|---|---|---|---|---|---|
0.14 | -0.70 | -0.23 | -0.45 | 0.54 | -0.34 | 0.08 | -0.30 |
-0.60 | -0.79 | 0.12 | -0.47 | -0.18 | -0.42 | 0.45 | -0.49 |
-0.34 | 0.61 | 0.24 | 0.15 | -0.47 | 0.49 | 0.13 | 0.20 |
1.56 | 1.38 | 1.39 | 1.27 | 0.42 | 0.37 | 0.50 | 1.66 |
-1.51 | -0.98 | 0.42 | -1.46 | -0.20 | 0.18 | 1.44 | -0.80 |
0.53 | 1.07 | 2.20 | 0.42 | 0.15 | 0.74 | 1.91 | 1.45 |
Composite scores with equal weights can also be added to new data:
add_composite_scores(d, m )
A1 | A2 | A3 | A_Composite |
---|---|---|---|
2 | 2.5 | 1.3 | 2.2 |
-1 | -1.5 | -2.1 | -1.8 |
Suppose that we want to verify that the data generated by the sim_standardized
function is correct. We will need an analogous model, but with all the fixed parameters set free. We could manually remove the fixed parameter values, but with large models the process is tedious and introduces a risk of error. The fixed2free
function painlessly removes the fixed parameters values from the model.
# lavaan syntax for model
<- "
m A =~ 0.7 * A1 + 0.8 * A2 + 0.9 * A3 + 0.3 * B1
B =~ 0.7 * B1 + 0.8 * B2 + 0.9 * B3
B ~ 0.6 * A
"
# Make model m free
<- fixed2free(m)
m_free # Display model m_free
cat(m_free)
#> B ~ A
#> A =~ A1 + A2 + A3 + B1
#> B =~ B1 + B2 + B3
Now let’s use lavaan to see if the observed data in d
conform to the model in m_free
.
# Set the random number generator for reproducible results
set.seed(12)
# Generate data based on model m
<- sim_standardized(
d
m,n = 100000,
latent = FALSE,
errors = FALSE)
# Evaluate the fit of model m_free on data d
library(lavaan)
<- sem(
lav_results model = m_free,
data = d)
# Display summary of model
summary(
lav_results, standardized = TRUE,
fit.measures = TRUE)
#> lavaan 0.6-8 ended normally after 27 iterations
#>
#> Estimator ML
#> Optimization method NLMINB
#> Number of model parameters 14
#>
#> Number of observations 100000
#>
#> Model Test User Model:
#>
#> Test statistic 7.493
#> Degrees of freedom 7
#> P-value (Chi-square) 0.379
#>
#> Model Test Baseline Model:
#>
#> Test statistic 371352.125
#> Degrees of freedom 15
#> P-value 0.000
#>
#> User Model versus Baseline Model:
#>
#> Comparative Fit Index (CFI) 1.000
#> Tucker-Lewis Index (TLI) 1.000
#>
#> Loglikelihood and Information Criteria:
#>
#> Loglikelihood user model (H0) -666610.982
#> Loglikelihood unrestricted model (H1) -666607.236
#>
#> Akaike (AIC) 1333249.965
#> Bayesian (BIC) 1333383.146
#> Sample-size adjusted Bayesian (BIC) 1333338.653
#>
#> Root Mean Square Error of Approximation:
#>
#> RMSEA 0.001
#> 90 Percent confidence interval - lower 0.000
#> 90 Percent confidence interval - upper 0.004
#> P-value RMSEA <= 0.05 1.000
#>
#> Standardized Root Mean Square Residual:
#>
#> SRMR 0.001
#>
#> Parameter Estimates:
#>
#> Standard errors Standard
#> Information Expected
#> Information saturated (h1) model Structured
#>
#> Latent Variables:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> A =~
#> A1 1.000 0.703 0.703
#> A2 1.142 0.005 231.116 0.000 0.803 0.800
#> A3 1.284 0.005 247.676 0.000 0.903 0.901
#> B1 0.427 0.004 114.547 0.000 0.300 0.300
#> B =~
#> B1 1.000 0.701 0.700
#> B2 1.139 0.005 238.139 0.000 0.798 0.798
#> B3 1.288 0.005 247.366 0.000 0.902 0.901
#>
#> Regressions:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> B ~
#> A 0.597 0.004 141.196 0.000 0.599 0.599
#>
#> Variances:
#> Estimate Std.Err z-value P(>|z|) Std.lv Std.all
#> .A1 0.507 0.003 194.579 0.000 0.507 0.506
#> .A2 0.362 0.002 165.904 0.000 0.362 0.359
#> .A3 0.188 0.002 97.276 0.000 0.188 0.188
#> .B1 0.170 0.001 130.982 0.000 0.170 0.169
#> .B2 0.362 0.002 178.775 0.000 0.362 0.362
#> .B3 0.189 0.002 106.863 0.000 0.189 0.189
#> A 0.495 0.004 121.519 0.000 1.000 1.000
#> .B 0.315 0.003 124.248 0.000 0.641 0.641
# Extract RAM paths
<- lav2ram(lav_results)
RAM
# Display asymmetric paths (i.e., single-headed arrows for
# loadings and structure coefficients)
$A %>% ggcor() RAM
# Display symmetric paths (i.e., curved double-headed arrows
# exogenous variances, error variances, disturbance variances,
# and any covariances among these)
$S %>% ggcor() RAM
As can be seen, all the fit measures indicate a near-perfect fit, and the parameter estimates are within rounding error of the fixed parameters in model m
.
Although the simstandardized
function will generate data for you, you might want to use a function from a different package instead, such as lavaan::simulateData
or simsem::sim
. In this case, you can use the model_complete
function to output the lavaan syntax for a standardized model with all standardized variances specified.
# Specify model
<- "
m A =~ 0.7 * A1 + 0.8 * A2 + 0.9 * A3 + 0.3 * B1
B =~ 0.7 * B1 + 0.8 * B2 + 0.9 * B3
B ~ 0.6 * A
"
<- model_complete(m)
m_complete # Display complete model
cat(m_complete)
#>
#> A =~ 0.7 * A1 + 0.8 * A2 + 0.9 * A3 + 0.3 * B1
#> B =~ 0.7 * B1 + 0.8 * B2 + 0.9 * B3
#> B ~ 0.6 * A
#>
#> # Variances
#> A1 ~~ 0.51 * A1
#> A2 ~~ 0.36 * A2
#> A3 ~~ 0.19 * A3
#> B1 ~~ 0.168 * B1
#> B2 ~~ 0.36 * B2
#> B3 ~~ 0.19 * B3
#> A ~~ 1 * A
#> B ~~ 0.64 * B
Suppose that a research article provides model coefficients in a table. We could spend time creating lavaan syntax by hand, but such work can be tedious. The matrix2lavaan
function can help save time when the models are already specified in matrix form.
The measurement model can be specified with a matrix in which the column names are latent variables and the row names are indicator variables.
Here we have three latent variables, Vocabulary, Working Memory Capacity, and Reading, each defined by three indicator variables.
<- matrix(c(
m_meas 0.8,0,0, # VC1
0.9,0,0, # VC2
0.7,0,0, # VC3
0,0.6,0, # WM1
0,0.7,0, # WM2
0,0.8,0, # WM3
0,0,0.9, # RD1
0,0,0.7, # RD2
0,0,0.8), # RD3
nrow = 9,
byrow = TRUE,
dimnames = list(
c("VC1", "VC2", "VC3",
"WM1", "WM2", "WM3",
"RD1", "RD2", "RD3"),
c("Vocabulary", "WorkingMemory", "Reading")))
The structural model can be specified with a matrix in which the predictors are the column names and the criterion variables are the row names.
Here we have Vocabulary and Working Memory Capacity predicting Reading Scores.
<- matrix(
m_struct c(0.4,0.3),
ncol = 2,
dimnames = list(
"Reading",
c("Vocabulary", "WorkingMemory")))
This could have been a 3 by 3 matrix with zeroes (which are ignored).
<- matrix(c(
m_struct 0, 0, 0, # Vocabulary
0, 0, 0, # WorkingMemory
0.4, 0.3, 0), # Reading
nrow = 3,
byrow = TRUE)
rownames(m_struct) <- c("Vocabulary", "WorkingMemory", "Reading")
colnames(m_struct) <- c("Vocabulary", "WorkingMemory", "Reading")
The variances and covariances must be specified as a symmetric matrix, though variables can be omitted.
Here we specify that the latent variables Vocabulary and Working Memory Capacity are correlated.
<- matrix(c(
m_cov 1, 0.5,
0.5, 1),
nrow = 2,
dimnames = list(
c("Vocabulary", "WorkingMemory"),
c("Vocabulary", "WorkingMemory")))
matrix2lavaan
functionThe matrix2lavaan
function takes arguments for the measurement model, structural model, and covariances. Any of the three matrices can be omitted.
<- matrix2lavaan(measurement_model = m_meas,
model structural_model = m_struct,
covariances = m_cov)
cat(model)
#> Vocabulary =~ 0.8 * VC1 + 0.9 * VC2 + 0.7 * VC3
#> WorkingMemory =~ 0.6 * WM1 + 0.7 * WM2 + 0.8 * WM3
#> Reading =~ 0.9 * RD1 + 0.7 * RD2 + 0.8 * RD3
#> Reading ~ 0.4 * Vocabulary + 0.3 * WorkingMemory
#> Vocabulary ~~ 0.5 * WorkingMemory
As an alternative, the matrix2lavaan
function can take data.frames (or tibbles) with either rownames or the first column as a character vector.
# A tibble with indicator variables listed in the first column
<- tibble::tribble(
m_meas ~Test, ~Vocabulary, ~WorkingMemory, ~Reading,
"VC1", 0.8, 0, 0,
"VC2", 0.9, 0, 0,
"VC3", 0.7, 0, 0,
"WM1", 0, 0.6, 0,
"WM2", 0, 0.7, 0,
"WM3", 0, 0.8, 0,
"RD1", 0, 0, 0.9,
"RD2", 0, 0, 0.7,
"RD3", 0, 0, 0.8)
# A data.frame with criterion variable specified as a row name
<- data.frame(Vocabulary = 0.4,
m_struct WorkingMemory = 0.3,
row.names = "Reading")
# A data.frame with variable names specified as row names
<- data.frame(Vocabulary = c(1, 0.5),
m_cov WorkingMemory = c(0.5, 1))
rownames(m_cov) <- c("Vocabulary", "WorkingMemory")
<- matrix2lavaan(measurement_model = m_meas,
model structural_model = m_struct,
covariances = m_cov)
After specifying a standardized model with lavaan syntax, we can extract a model-implied correlation matrix. By default, we extract just the correlations among the observed variables.
get_model_implied_correlations(m) %>%
ggcor()
It is possible to extract the model-implied correlations among the observed variables, latent variables, error terms, factor scores, and composite variables. For example, here we extract correlations among the observed and latent variables:
get_model_implied_correlations(m,
latent = TRUE) %>%
ggcor()