This vignette is intended to provide a set of examples that nearly
exhaustively demonstrate the functionalities provided by
infer
. Commentary on these examples is limited—for more
discussion of the intuition behind the package, see the “Getting to Know
infer” vignette, accessible by calling
vignette("infer")
.
Throughout this vignette, we’ll make use of the gss
dataset supplied by infer
, which contains a sample of data
from the General Social Survey. See ?gss
for more
information on the variables included and their source. Note that this
data (and our examples on it) are for demonstration purposes only, and
will not necessarily provide accurate estimates unless weighted
properly. For these examples, let’s suppose that this dataset is a
representative sample of a population we want to learn about: American
adults. The data looks like this:
# load in the dataset
data(gss)
# take a look at its structure
::glimpse(gss) dplyr
## Rows: 500
## Columns: 11
## $ year <dbl> 2014, 1994, 1998, 1996, 1994, 1996, 1990, 2016, 2000, 1998, 20…
## $ age <dbl> 36, 34, 24, 42, 31, 32, 48, 36, 30, 33, 21, 30, 38, 49, 25, 56…
## $ sex <fct> male, female, male, male, male, female, female, female, female…
## $ college <fct> degree, no degree, degree, no degree, degree, no degree, no de…
## $ partyid <fct> ind, rep, ind, ind, rep, rep, dem, ind, rep, dem, dem, ind, de…
## $ hompop <dbl> 3, 4, 1, 4, 2, 4, 2, 1, 5, 2, 4, 3, 4, 4, 2, 2, 3, 2, 1, 2, 5,…
## $ hours <dbl> 50, 31, 40, 40, 40, 53, 32, 20, 40, 40, 23, 52, 38, 72, 48, 40…
## $ income <ord> $25000 or more, $20000 - 24999, $25000 or more, $25000 or more…
## $ class <fct> middle class, working class, working class, working class, mid…
## $ finrela <fct> below average, below average, below average, above average, ab…
## $ weight <dbl> 0.8960, 1.0825, 0.5501, 1.0864, 1.0825, 1.0864, 1.0627, 0.4785…
Calculating the observed statistic,
<- gss %>%
x_bar specify(response = hours) %>%
calculate(stat = "mean")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
x_bar observe(response = hours, stat = "mean")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
generate(reps = 1000) %>%
calculate(stat = "mean")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = x_bar, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = x_bar, direction = "two-sided")
p_value |
---|
0.038 |
Calculating the observed statistic,
<- gss %>%
t_bar specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
calculate(stat = "t")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
t_bar observe(response = hours, null = "point", mu = 40, stat = "t")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
generate(reps = 1000) %>%
calculate(stat = "t")
Alternatively, finding the null distribution using theoretical
methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(response = hours) %>%
assume("t")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = t_bar, direction = "two-sided")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = t_bar, direction = "two-sided")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = t_bar, direction = "two-sided")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = t_bar, direction = "two-sided")
p_value |
---|
0.028 |
Alternatively, using the t_test
wrapper:
%>%
gss t_test(response = hours, mu = 40)
statistic | t_df | p_value | alternative | estimate | lower_ci | upper_ci |
---|---|---|---|---|---|---|
2.085 | 499 | 0.0376 | two.sided | 41.38 | 40.08 | 42.68 |
infer
does not support testing on one numerical variable
via the z
distribution.
Calculating the observed statistic,
<- gss %>%
x_tilde specify(response = age) %>%
calculate(stat = "median")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
x_tilde observe(response = age, stat = "median")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = age) %>%
hypothesize(null = "point", med = 40) %>%
generate(reps = 1000) %>%
calculate(stat = "median")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = x_tilde, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = x_tilde, direction = "two-sided")
p_value |
---|
0.008 |
Calculating the observed statistic,
<- gss %>%
p_hat specify(response = sex, success = "female") %>%
calculate(stat = "prop")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
p_hat observe(response = sex, success = "female", stat = "prop")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = sex, success = "female") %>%
hypothesize(null = "point", p = .5) %>%
generate(reps = 1000) %>%
calculate(stat = "prop")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = p_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = p_hat, direction = "two-sided")
p_value |
---|
0.254 |
Note that logical variables will be coerced to factors:
<- gss %>%
null_dist ::mutate(is_female = (sex == "female")) %>%
dplyrspecify(response = is_female, success = "TRUE") %>%
hypothesize(null = "point", p = .5) %>%
generate(reps = 1000) %>%
calculate(stat = "prop")
Calculating the observed statistic,
<- gss %>%
p_hat specify(response = sex, success = "female") %>%
hypothesize(null = "point", p = .5) %>%
calculate(stat = "z")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
p_hat observe(response = sex, success = "female", null = "point", p = .5, stat = "z")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = sex, success = "female") %>%
hypothesize(null = "point", p = .5) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "z")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = p_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = p_hat, direction = "two-sided")
p_value |
---|
0.252 |
The package also supplies a wrapper around prop.test
for
tests of a single proportion on tidy data.
prop_test(gss,
~ NULL,
college p = .2)
statistic | chisq_df | p_value | alternative |
---|---|---|---|
635.6 | 1 | 0 | two.sided |
infer
does not support testing two means via the
z
distribution.
The infer
package provides several statistics to work
with data of this type. One of them is the statistic for difference in
proportions.
Calculating the observed statistic,
<- gss %>%
d_hat specify(college ~ sex, success = "no degree") %>%
calculate(stat = "diff in props", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
d_hat observe(college ~ sex, success = "no degree",
stat = "diff in props", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "diff in props", order = c("female", "male"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value |
---|
0.994 |
infer
also provides functionality to calculate ratios of
proportions. The workflow looks similar to that for
diff in props
.
Calculating the observed statistic,
<- gss %>%
r_hat specify(college ~ sex, success = "no degree") %>%
calculate(stat = "ratio of props", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
r_hat observe(college ~ sex, success = "no degree",
stat = "ratio of props", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "ratio of props", order = c("female", "male"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = r_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = r_hat, direction = "two-sided")
p_value |
---|
1 |
In addition, the package provides functionality to calculate odds
ratios. The workflow also looks similar to that for
diff in props
.
Calculating the observed statistic,
<- gss %>%
or_hat specify(college ~ sex, success = "no degree") %>%
calculate(stat = "odds ratio", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "odds ratio", order = c("female", "male"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = or_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = or_hat, direction = "two-sided")
p_value |
---|
1 |
Finding the standardized observed statistic,
<- gss %>%
z_hat specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
calculate(stat = "z", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
z_hat observe(college ~ sex, success = "no degree",
stat = "z", order = c("female", "male"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(college ~ sex, success = "no degree") %>%
hypothesize(null = "independence") %>%
generate(reps = 1000) %>%
calculate(stat = "z", order = c("female", "male"))
Alternatively, finding the null distribution using theoretical
methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(college ~ sex, success = "no degree") %>%
assume("z")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = z_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = z_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = z_hat, direction = "two-sided")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = z_hat, direction = "two-sided")
p_value |
---|
0.992 |
Note the similarities in this plot and the previous one.
The package also supplies a wrapper around prop.test
to
allow for tests of equality of proportions on tidy data.
prop_test(gss,
~ sex,
college order = c("female", "male"))
statistic | chisq_df | p_value | alternative | lower_ci | upper_ci |
---|---|---|---|---|---|
0 | 1 | 0.9964 | two.sided | -0.1009 | 0.0917 |
Calculating the observed statistic,
Note the need to add in the hypothesized values here to compute the observed statistic.
<- gss %>%
Chisq_hat specify(response = finrela) %>%
hypothesize(null = "point",
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6)) %>%
calculate(stat = "Chisq")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
Chisq_hat observe(response = finrela,
null = "point",
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6),
stat = "Chisq")
Then, generating the null distribution,
<- gss %>%
null_dist specify(response = finrela) %>%
hypothesize(null = "point",
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6)) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "Chisq")
Alternatively, finding the null distribution using theoretical
methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(response = finrela) %>%
assume("Chisq")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist_theory, method = "both") +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = Chisq_hat, direction = "greater")
p_value |
---|
0 |
Alternatively, using the chisq_test
wrapper:
chisq_test(gss,
response = finrela,
p = c("far below average" = 1/6,
"below average" = 1/6,
"average" = 1/6,
"above average" = 1/6,
"far above average" = 1/6,
"DK" = 1/6))
statistic | chisq_df | p_value |
---|---|---|
488 | 5 | 0 |
Calculating the observed statistic,
<- gss %>%
Chisq_hat specify(formula = finrela ~ sex) %>%
hypothesize(null = "independence") %>%
calculate(stat = "Chisq")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
Chisq_hat observe(formula = finrela ~ sex, stat = "Chisq")
Then, generating the null distribution,
<- gss %>%
null_dist specify(finrela ~ sex) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "Chisq")
Alternatively, finding the null distribution using theoretical
methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(finrela ~ sex) %>%
assume(distribution = "Chisq")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = Chisq_hat, direction = "greater")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = Chisq_hat, direction = "greater")
p_value |
---|
0.119 |
Alternatively, using the wrapper to carry out the test,
%>%
gss chisq_test(formula = finrela ~ sex)
statistic | chisq_df | p_value |
---|---|---|
9.105 | 5 | 0.1049 |
Calculating the observed statistic,
<- gss %>%
d_hat specify(age ~ college) %>%
calculate(stat = "diff in means", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
d_hat observe(age ~ college,
stat = "diff in means", order = c("degree", "no degree"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("degree", "no degree"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value |
---|
0.414 |
Finding the standardized observed statistic,
<- gss %>%
t_hat specify(age ~ college) %>%
hypothesize(null = "independence") %>%
calculate(stat = "t", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
t_hat observe(age ~ college,
stat = "t", order = c("degree", "no degree"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "t", order = c("degree", "no degree"))
Alternatively, finding the null distribution using theoretical
methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(age ~ college) %>%
assume("t")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = t_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = t_hat, direction = "two-sided")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = t_hat, direction = "two-sided")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = t_hat, direction = "two-sided")
p_value |
---|
0.404 |
Note the similarities in this plot and the previous one.
Calculating the observed statistic,
<- gss %>%
d_hat specify(age ~ college) %>%
calculate(stat = "diff in medians", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
d_hat observe(age ~ college,
stat = "diff in medians", order = c("degree", "no degree"))
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ college) %>% # alt: response = age, explanatory = season
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in medians", order = c("degree", "no degree"))
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = d_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = d_hat, direction = "two-sided")
p_value |
---|
0.136 |
Calculating the observed statistic,
<- gss %>%
F_hat specify(age ~ partyid) %>%
calculate(stat = "F")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
F_hat observe(age ~ partyid, stat = "F")
Then, generating the null distribution,
<- gss %>%
null_dist specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "F")
Alternatively, finding the null distribution using theoretical
methods using the assume()
verb,
<- gss %>%
null_dist_theory specify(age ~ partyid) %>%
hypothesize(null = "independence") %>%
assume(distribution = "F")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = F_hat, direction = "greater")
Alternatively, visualizing the observed statistic using the theory-based null distribution,
visualize(null_dist_theory) +
shade_p_value(obs_stat = F_hat, direction = "greater")
Alternatively, visualizing the observed statistic using both of the null distributions,
visualize(null_dist, method = "both") +
shade_p_value(obs_stat = F_hat, direction = "greater")
Note that the above code makes use of the randomization-based null distribution.
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = F_hat, direction = "greater")
p_value |
---|
0.066 |
Calculating the observed statistic,
<- gss %>%
slope_hat specify(hours ~ age) %>%
calculate(stat = "slope")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
slope_hat observe(hours ~ age, stat = "slope")
Then, generating the null distribution,
<- gss %>%
null_dist specify(hours ~ age) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "slope")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = slope_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = slope_hat, direction = "two-sided")
p_value |
---|
0.842 |
Calculating the observed statistic,
<- gss %>%
correlation_hat specify(hours ~ age) %>%
calculate(stat = "correlation")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
correlation_hat observe(hours ~ age, stat = "correlation")
Then, generating the null distribution,
<- gss %>%
null_dist specify(hours ~ age) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "correlation")
Visualizing the observed statistic alongside the null distribution,
visualize(null_dist) +
shade_p_value(obs_stat = correlation_hat, direction = "two-sided")
Calculating the p-value from the null distribution and observed statistic,
%>%
null_dist get_p_value(obs_stat = correlation_hat, direction = "two-sided")
p_value |
---|
0.854 |
Not currently implemented since \(t\) could refer to standardized slope or standardized correlation.
Calculating the observed fit,
<- gss %>%
obs_fit specify(hours ~ age + college) %>%
fit()
Generating a distribution of fits with the response variable permuted,
<- gss %>%
null_dist specify(hours ~ age + college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
fit()
Generating a distribution of fits where each explanatory variable is permuted independently,
<- gss %>%
null_dist2 specify(hours ~ age + college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute", variables = c(age, college)) %>%
fit()
Visualizing the observed fit alongside the null fits,
visualize(null_dist) +
shade_p_value(obs_stat = obs_fit, direction = "two-sided")
Calculating p-values from the null distribution and observed fit,
%>%
null_dist get_p_value(obs_stat = obs_fit, direction = "two-sided")
term | p_value |
---|---|
age | 0.858 |
collegedegree | 0.348 |
intercept | 0.700 |
Note that this fit()
-based workflow can be applied to
use cases with differing numbers of explanatory variables and
explanatory variable types.
Finding the observed statistic,
<- gss %>%
x_bar specify(response = hours) %>%
calculate(stat = "mean")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
x_bar observe(response = hours, stat = "mean")
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(response = hours) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean")
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- get_ci(boot_dist, type = "se", point_estimate = x_bar)
standard_error_ci
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Instead of a simulation-based bootstrap distribution, we can also define a theory-based sampling distribution,
<- gss %>%
sampling_dist specify(response = hours) %>%
assume(distribution = "t")
Visualization and calculation of confidence intervals interfaces in the same way as with the simulation-based distribution,
<- get_ci(sampling_dist, point_estimate = x_bar)
theor_ci
theor_ci
lower_ci | upper_ci |
---|---|
40.08 | 42.68 |
visualize(sampling_dist) +
shade_confidence_interval(endpoints = theor_ci)
Note that the t
distribution is recentered and rescaled
to lie on the scale of the observed data. infer
does not
support confidence intervals on means via the z
distribution.
Finding the observed statistic,
<- gss %>%
t_hat specify(response = hours) %>%
hypothesize(null = "point", mu = 40) %>%
calculate(stat = "t")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
t_hat observe(response = hours,
null = "point", mu = 40,
stat = "t")
Then, generating the bootstrap distribution,
<- gss %>%
boot_dist specify(response = hours) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "t")
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = t_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
See the above subsection (one mean) for a theory-based approach. Note
that infer
does not support confidence intervals on means
via the z
distribution.
Finding the observed statistic,
<- gss %>%
p_hat specify(response = sex, success = "female") %>%
calculate(stat = "prop")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
p_hat observe(response = sex, success = "female", stat = "prop")
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(response = sex, success = "female") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop")
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = p_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Instead of a simulation-based bootstrap distribution, we can also define a theory-based sampling distribution,
<- gss %>%
sampling_dist specify(response = sex, success = "female") %>%
assume(distribution = "z")
Visualization and calculation of confidence intervals interfaces in the same way as with the simulation-based distribution,
<- get_ci(sampling_dist, point_estimate = p_hat)
theor_ci
theor_ci
lower_ci | upper_ci |
---|---|
0.4302 | 0.5178 |
visualize(sampling_dist) +
shade_confidence_interval(endpoints = theor_ci)
Note that the z
distribution is recentered and rescaled
to lie on the scale of the observed data. infer
does not
support confidence intervals on means via the z
distribution.
See the above subsection (one proportion) for a theory-based approach.
Finding the observed statistic,
<- gss %>%
d_hat specify(hours ~ college) %>%
calculate(stat = "diff in means", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
d_hat observe(hours ~ college,
stat = "diff in means", order = c("degree", "no degree"))
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(hours ~ college) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("degree", "no degree"))
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = d_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Instead of a simulation-based bootstrap distribution, we can also define a theory-based sampling distribution,
<- gss %>%
sampling_dist specify(hours ~ college) %>%
assume(distribution = "t")
Visualization and calculation of confidence intervals interfaces in the same way as with the simulation-based distribution,
<- get_ci(sampling_dist, point_estimate = d_hat)
theor_ci
theor_ci
lower_ci | upper_ci |
---|---|
-1.164 | 4.241 |
visualize(sampling_dist) +
shade_confidence_interval(endpoints = theor_ci)
Note that the t
distribution is recentered and rescaled
to lie on the scale of the observed data.
infer
also provides functionality to calculate ratios of
means. The workflow looks similar to that for
diff in means
.
Finding the observed statistic,
<- gss %>%
d_hat specify(hours ~ college) %>%
calculate(stat = "ratio of means", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
d_hat observe(hours ~ college,
stat = "ratio of means", order = c("degree", "no degree"))
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(hours ~ college) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "ratio of means", order = c("degree", "no degree"))
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = d_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Finding the standardized point estimate,
<- gss %>%
t_hat specify(hours ~ college) %>%
calculate(stat = "t", order = c("degree", "no degree"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
t_hat observe(hours ~ college,
stat = "t", order = c("degree", "no degree"))
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(hours ~ college) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "t", order = c("degree", "no degree"))
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = t_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
See the above subsection (diff in means) for a theory-based approach.
infer
does not support confidence intervals on means via
the z
distribution.
Finding the observed statistic,
<- gss %>%
d_hat specify(college ~ sex, success = "degree") %>%
calculate(stat = "diff in props", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
d_hat observe(college ~ sex, success = "degree",
stat = "diff in props", order = c("female", "male"))
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(college ~ sex, success = "degree") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "diff in props", order = c("female", "male"))
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = d_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Instead of a simulation-based bootstrap distribution, we can also define a theory-based sampling distribution,
<- gss %>%
sampling_dist specify(college ~ sex, success = "degree") %>%
assume(distribution = "z")
Visualization and calculation of confidence intervals interfaces in the same way as with the simulation-based distribution,
<- get_ci(sampling_dist, point_estimate = d_hat)
theor_ci
theor_ci
lower_ci | upper_ci |
---|---|
-0.0794 | 0.0878 |
visualize(sampling_dist) +
shade_confidence_interval(endpoints = theor_ci)
Note that the z
distribution is recentered and rescaled
to lie on the scale of the observed data.
Finding the standardized point estimate,
<- gss %>%
z_hat specify(college ~ sex, success = "degree") %>%
calculate(stat = "z", order = c("female", "male"))
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
z_hat observe(college ~ sex, success = "degree",
stat = "z", order = c("female", "male"))
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(college ~ sex, success = "degree") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "z", order = c("female", "male"))
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = z_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
See the above subsection (diff in props) for a theory-based approach.
Finding the observed statistic,
<- gss %>%
slope_hat specify(hours ~ age) %>%
calculate(stat = "slope")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
slope_hat observe(hours ~ age, stat = "slope")
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(hours ~ age) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "slope")
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = slope_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Finding the observed statistic,
<- gss %>%
correlation_hat specify(hours ~ age) %>%
calculate(stat = "correlation")
Alternatively, using the observe()
wrapper to calculate
the observed statistic,
<- gss %>%
correlation_hat observe(hours ~ age, stat = "correlation")
Then, generating a bootstrap distribution,
<- gss %>%
boot_dist specify(hours ~ age) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "correlation")
Use the bootstrap distribution to find a confidence interval,
<- get_ci(boot_dist) percentile_ci
Visualizing the observed statistic alongside the distribution,
visualize(boot_dist) +
shade_confidence_interval(endpoints = percentile_ci)
Alternatively, use the bootstrap distribution to find a confidence interval using the standard error,
<- boot_dist %>%
standard_error_ci get_ci(type = "se", point_estimate = correlation_hat)
visualize(boot_dist) +
shade_confidence_interval(endpoints = standard_error_ci)
Not currently implemented since \(t\) could refer to standardized slope or standardized correlation.
Calculating the observed fit,
<- gss %>%
obs_fit specify(hours ~ age + college) %>%
fit()
Generating a distribution of fits with the response variable permuted,
<- gss %>%
null_dist specify(hours ~ age + college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
fit()
Alternatively, generating a distribution of fits where each explanatory variable is permuted independently,
<- gss %>%
null_dist2 specify(hours ~ age + college) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute", variables = c(age, college)) %>%
fit()
Calculating confidence intervals from the null fits,
<-
conf_ints get_confidence_interval(
null_dist, level = .95,
point_estimate = obs_fit
)
Visualizing the observed fit alongside the null fits,
visualize(null_dist) +
shade_confidence_interval(endpoints = conf_ints)
Note that this fit()
-based workflow can be applied to
use cases with differing numbers of explanatory variables and
explanatory variable types.