Table of contents:
Univariate percentages and 95% log transformed confidence intervals
Univariate percentages and 95% Wald confidence intervals
Bivariate percentages and 95% log transformed confidence intervals
Interpretation of confidence intervals
The freqtables
package is designed to quickly make tables of descriptive statistics for categorical variables (i.e., counts, percentages, confidence intervals). This package is designed to work in a Tidyverse pipeline and consideration has been given to get results from R to Microsoft Word ® with minimal pain.
The package currently consistes of the following functions:
1. freq_table()
: Estimate Percentages and 95 Percent Confidence Intervals in dplyr
Pipelines.
2. freq_test()
: Hypothesis Testing For Frequency Tables.
3. freq_format()
: Format freq_table Output for Publication and Dissemination.
4. freq_group_n()
: Formatted Group Sample Sizes for Tables
This vignette is not intended to be representative of every possible descriptive analysis that one may want to carry out on a given data set. Rather, it is intended to be representative of descriptive analyses that are commonly used when conducting epidemiologic research.
library(dplyr)
library(freqtables)
data(mtcars)
In this section we provide an example of calculating common univariate descriptive statistics for a single categorical variable. Again, we are assuming that we are working in a dplyr
pipeline and that we are passing a grouped data frame to the freq_table()
function.
The default confidence intervals are logit transformed - matching the method used by Stata: https://www.stata.com/manuals13/rproportion.pdf
%>%
mtcars freq_table(am)
## var cat n n_total percent se t_crit lcl ucl
## 1 am 0 19 32 59.375 8.820997 2.039513 40.94225 75.49765
## 2 am 1 13 32 40.625 8.820997 2.039513 24.50235 59.05775
Interpretation of results:
var
contains the name of the variable passed to the freq_table()
function.
cat
contains the unique levels (values) of the variable in var
.
n
contains a count of the number of rows in the data frame that have the value cat
for the variable var
.
n_total
contains the sum of n
.
percent
= n
/ n_total
.
se
= \(\sqrt{proportion * (1 - proportion) / (n - 1)}\)
t_crit
is the critical value from Student’s t distribution with n_total
- 1 degrees of freedom. The default probability value is 0.975, which corresponds to an alpha of 0.05.
lcl
is the lower bound of the confidence interval. By default, it is a 95% confidence interval.
ucl
is the upper bound of the confidence interval. By default, it is a 95% confidence interval.
Compare to Stata:
Optionally, the ci_type = "wald"
argument can be used to calculate Wald confidence intervals that match those returned by SAS.
The exact methods are documented here:
%>%
mtcars freq_table(am, ci_type = "wald")
## var cat n n_total percent se t_crit lcl ucl
## 1 am 0 19 32 59.375 8.820997 2.039513 41.38446 77.36554
## 2 am 1 13 32 40.625 8.820997 2.039513 22.63446 58.61554
Compare to SAS:
The default behavior of freq_table()
is to return 95% confidence intervals (two-sided). However, this behavior can be adjusted to return any alpha level. For example, to return 99% confidence intervals instead just pass 99 to the percent_ci
parameter of freq_table()
as demonstrated below.
%>%
mtcars freq_table(am, percent_ci = 99)
## var cat n n_total percent se t_crit lcl ucl
## 1 am 0 19 32 59.375 8.820997 2.744042 34.88730 79.94685
## 2 am 1 13 32 40.625 8.820997 2.744042 20.05315 65.11270
Notice that the lower bounds of the 99% confidence limits (34.88730 and 20.05315) are less than the lower bounds of the 95% confidence limits (40.94225 and 24.50235). Likewise, the upper bounds of the 99% confidence limits (79.94685 and 65.11270) are greater than the upper bounds of the 95% confidence limits (75.49765 and 59.05775)
In this section we provide an example of calculating common bivariate descriptive statistics for categorical variables. Currently, all confidence intervals for (grouped) row percentages, and their accompanying confidence intervals, are logit transformed - matching the method used by Stata: https://www.stata.com/manuals13/rproportion.pdf
At this time, you may pass two variables to the ...
argument to the freq_table()
function. The first variable is labeled row_var
in the resulting frequency table. The second variable passed to freq_table()
is labeled col_var
in the resulting frequency table. These labels are somewhat arbitrary and uniformative, but are used to match common naming conventions. Having said that, they may change in the future.
The resulting frequency table is organized so that the n and percent of observations where the value of col_var
equals col_cat
is calculated within levels of row_cat
. For example, the frequency table below tells us that that there are 11 rows (n_row
) with a value of 4
(row_cat
) for the variable cyl
(row_var
). Among those 11 rows only, there are 3 rows (n
) with a value of 0
(col_cat
) for the variable am
(col_var
), and 8 rows (n
) with a value of 1
(col_cat
) for the variable am
(col_var
).
%>%
mtcars freq_table(cyl, am)
## # A tibble: 6 × 17
## row_var row_cat col_var col_cat n n_row n_total percent_total se_total
## <chr> <chr> <chr> <chr> <int> <int> <int> <dbl> <dbl>
## 1 cyl 4 am 0 3 11 32 9.38 5.24
## 2 cyl 4 am 1 8 11 32 25 7.78
## 3 cyl 6 am 0 4 7 32 12.5 5.94
## 4 cyl 6 am 1 3 7 32 9.38 5.24
## 5 cyl 8 am 0 12 14 32 37.5 8.70
## 6 cyl 8 am 1 2 14 32 6.25 4.35
## # … with 8 more variables: t_crit_total <dbl>, lcl_total <dbl>,
## # ucl_total <dbl>, percent_row <dbl>, se_row <dbl>, t_crit_row <dbl>,
## # lcl_row <dbl>, ucl_row <dbl>
Interpretation of results:
row_var
contains the name of the first variable passed to the ...
argument of the freq_table()
function.
row_cat
contains the levels (values) of the variable in row_var
.
col_var
contains the name of the second variable passed to the ...
argument of the freq_table()
function.
col_cat
contains the levels (values) of the variable in col_var
.
n
contains a count of the number of rows in the data frame that have the value row_cat
for the variable row_var
AND the value col_cat
for the variable col_var
.
n_row
contains the sum of n
for each level of row_cat
.
n_total
contains the sum of n
.
percent_total
= n
/ n_total
.
se_total
= \(\sqrt{proportion_{overall} * (1 - proportion_{overall}) / (n_{overall} - 1)}\)
t_crit_total
is the critical value from Student’s t distribution with n_total
- 1 degrees of freedom. The default probability value is 0.975, which corresponds to an alpha of 0.05.
lcl_total
is the lower bound of the confidence interval around percent_total
. By default, it is a 95% confidence interval.
ucl_total
is the upper bound of the confidence interval around percent_total
. By default, it is a 95% confidence interval.
percent_row
= n
/ n_row
.
se_row
= \(\sqrt{proportion_{row} * (1 - proportion_{row}) / (n_{row} - 1)}\)
t_crit_row
is the critical value from Student’s t distribution with n_total
- 1 degrees of freedom. The default probability value is 0.975, which corresponds to an alpha of 0.05.
lcl_row
is the lower bound of the confidence interval around percent_row
. By default, it is a 95% confidence interval.
ucl_row
is the upper bound of the confidence interval around percent_row
. By default, it is a 95% confidence interval.
Compare to Stata:
These estimates do not match those generated by SAS, which uses a different variance estimation method (https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_surveyfreq_a0000000217.htm).
The following are frequentist interpretations for 95% confidence intervals taken from relevant texts and peer-reviewed journal articles.
Biostatistics: A foundation for analysis in the health sciences
In repeated sampling, from a normally distributed population with a known standard deviation, 95% of all intervals will in the long run include the populations mean.
Daniel, W. W., & Cross, C. L. (2013). Biostatistics: A foundation for analysis in the health sciences (Tenth). Hoboken, NJ: Wiley.
Fundamentals of biostatistics
You may be puzzled at this point as to what a CI is. The parameter µ is a fixed unknown constant. How can we state that the probability that it lies within some specific interval is, for example, 95%? The important point to understand is that the boundaries of the interval depend on the sample mean and sample variance and vary from sample to sample. Furthermore, 95% of such intervals that could be constructed from repeated random samples of size n contain the parameter µ.
The idea is that over a large number of hypothetical samples of size 10, 95% of such intervals contain the parameter µ. Any one interval from a particular sample may or may not contain the parameter µ. In Figure 6.7, by chance all five intervals contain the parameter µ. However, with additional random samples this need not be the case. Therefore, we cannot say there is a 95% chance that the parameter µ will fall within a particular 95% CI. However, we can say the following: The length of the CI gives some idea of the precision of the point estimate x. In this particular case, the length of each CI ranges from 20 to 47 oz, which makes the precision of the point estimate x doubtful and implies that a larger sample size is needed to get a more precise estimate of µ.
Rosner, B. (2015). Fundamentals of biostatistics (Eighth). MA: Cengage Learning.
Statistical modeling: A fresh approach
Treat the confidence interval just as an indication of the precision of the measurement. If you do a study that finds a statistic of 17 ± 6 and someone else does a study that gives 23 ± 5, then there is little reason to think that the two studies are inconsistent. On the other hand, if your study gives 17 ± 2 and the other study is 23 ± 1, then something seems to be going on; you have a genuine disagreement on your hands.
Kaplan, D. T. (2017). Statistical modeling: A fresh approach (Second). Project MOSAIC Books.
Modern epidemiology
If the underlying statistical model is correct and there is no bias, a confidence interval derived from a valid test will, over unlimited repetitions of the study, contain the true parameter with a frequency no less than its confidence level. This definition specifies the coverage property of the method used to generate the interval, not the probability that the true parameter value lies within the interval. For example, if the confidence level of a valid confidence interval is 90%, the frequency with which the interval will contain the true parameter will be at least 90%, if there is no bias. Consequently, under the assumed model for random variability (e.g., a binomial model, as described in Chapter 14) and with no bias, we should expect the confidence interval to include the true parameter value in at least 90% of replications of the process of obtaining the data. Unfortunately, this interpretation for the confidence interval is based on probability models and sampling properties that are seldom realized in epidemiologic studies; consequently, it is preferable to view the confidence limits as only a rough estimate of the uncertainty in an epidemiologic result due to random error alone. Even with this limited interpretation, the estimate depends on the correctness of the statistical model, which may be incorrect in many epidemiologic settings (Greenland, 1990).
Furthermore, exact 95% confidence limits for the true rate ratio are 0.7–13. The fact that the null value (which, for the rate ratio, is 1.0) is within the interval tells us the outcome of the significance test: The estimate would not be statistically significant at the 1 - 0.95 = 0.05 alpha level. The confidence limits, however, indicate that these data, although statistically compatible with no association, are even more compatible with a strong association — assuming that the statistical model used to construct the limits is correct. Stating the latter assumption is important because confidence intervals, like P-values, do nothing to address biases that may be present.
Indeed, because statistical hypothesis testing promotes so much misinterpretation, we recommend avoiding its use in epidemiologic presentations and research reports. Such avoidance requires that P-values (when used) be presented without reference to alpha levels or “statistical significance,” and that careful attention be paid to the confidence interval, especially its width and its endpoints (the confidence limits) (Altman et al., 2000; Poole, 2001c).
An astute investigator may properly ask what frequency interpretations have to do with the single study under analysis. It is all very well to say that an interval estimation procedure will, in 95% of repetitions, produce limits that contain the true parameter. But in analyzing a given study, the relevant scientific question is this: Does the single pair of limits produced from this one study contain the true parameter? The ordinary (frequentist) theory of confidence intervals does not answer this question. The question is so important that many (perhaps most) users of confidence intervals mistakenly interpret the confidence level of the interval as the probability that the answer to the question is “yes.” It is quite tempting to say that the 95% confidence limits computed from a study contain the true parameter with 95% probability. Unfortunately, this interpretation can be correct only for Bayesian interval estimates (discussed later and in Chapter 18), which often diverge from ordinary confidence intervals.
Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Modern epidemiology (Third). Philadelphia, PA: Lippincott Williams & Wilkins.
Greenland, 2016
The specific 95 % confidence interval presented by a study has a 95% chance of containing the true effect size. No! A reported confidence interval is a range between two numbers. The frequency with which an observed interval (e.g., 0.72–2.88) contains the true effectis either 100% if the true effect is within the interval or 0% if not; the 95% refers only to how often 95% confidence intervals computed from very many studies would contain the true size if all the assumptions used to compute the intervals were correct.
The 95 % confidence intervals from two subgroups or studies may overlap substantially and yet the test for difference between them may still produce P < 0.05. Suppose for example, two 95 % confidence intervals for means from normal populations with known variancesare (1.04, 4.96) and (4.16, 19.84); these intervals overlap, yet the test of the hypothesis of no difference in effect across studies gives P = 0.03. As with P values, comparison between groups requires statistics that directly test and estimate the differences across groups. It can, however, be noted that if the two 95 % confidence intervals fail to overlap, then when using the same assumptions used to compute the confidence intervals we will find P > 0.05 for the difference; and if one of the 95% intervals contains the point estimate from the other group or study, we will find P > 0.05 for the difference.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3
Bottom Line
Give the point estimate along with the 95% confidence interval. Say NOTHING about statistical significance. Write some kind of statement about the data’s compatibility with the model. For example, “the confidence limits, however, indicate that these data, although statistically compatible with no association, are even more compatible with a strong association — assuming that the statistical model used to construct the limits is correct.”