2 Review of Summary Statistic Functions and Formatting

2.1 Means and Standard Deviations

mean_sd returns the (arithmetic) mean and standard deviation for numeric vector as a formatted character string. For example, mean_sd(mtcars2$mpg) returns the formatted string 20.09 ± 6.03. There are other options for formatting character string:

mean_sd(mtcars2$mpg)
## [1] "20.09 &plusmn; 6.03"
mean_sd(mtcars2$mpg, denote_sd = "paren")
## [1] "20.09 (6.03)"

2.2 Mean and Confidence intervals

If you need the mean and a confidence interval there is the function mean_ci`.` which returns aqwraps2_mean_ciobject which is a named vector with the mean, lower confidence limit, and the upper confidence limit. The printing method forqwraps2_mean_ciobjects is a call to thefrmtcifunction. You an modify the formatting of printed result by adjusting the arguments pasted tofrmtci.

mci <- mean_ci(mtcars2$mpg)
str(mci)
##  'qwraps2_mean_ci' Named num [1:3] 20.1 18 22.2
##  - attr(*, "names")= chr [1:3] "mean" "lcl" "ucl"
##  - attr(*, "alpha")= num 0.05
mci
## [1] "20.09 (18.00, 22.18)"
print(mci, show_level = TRUE)
## [1] "20.09 (95% CI: 18.00, 22.18)"

2.3 Median and Inner Quartile Range

Similar to the mean_sd function, the median_iqr returns the median and the inner quartile range (IQR) of a data vector.

median_iqr(mtcars2$mpg)
## [1] "19.20 (15.43, 22.80)"

2.4 Count and Percentages

The n_perc function is the workhorse. n_perc0 is also provided for ease of use in the same way that base R has paste and paste0` .`n_percreturns the n (%) with the percentage sign in the string,n_perc0`` omits the percentage sign from the string. The latter is good for tables, the former for in-line text.

n_perc(mtcars2$cyl == 4)
## [1] "11 (34.38%)"
n_perc0(mtcars2$cyl == 4)
## [1] "11 (34)"

n_perc(mtcars2$cyl_factor == 4)  # this returns 0 (0.00%)
## [1] "0 (0.00%)"
n_perc(mtcars2$cyl_factor == "4 cylinders")
## [1] "11 (34.38%)"
n_perc(mtcars2$cyl_factor == levels(mtcars2$cyl_factor)[2])
## [1] "11 (34.38%)"

# The count and percentage of 4 or 6 cylinders vehicles in the data set is
n_perc(mtcars2$cyl %in% c(4, 6))
## [1] "18 (56.25%)"

2.5 Geometric Means and Standard Deviations

Let $\left\{x_1, x_2, x_3, \ldots, x_n \right\}$ be a sample of size $n$ with $x_i > 0$ for all $i.$ Then the geometric mean, $\mu_g,$ and geometric standard deviation are

\[ \begin{equation} \mu_g = \left( \prod_{i = 1}^{n} x_i \right)^{\frac{1}{n}} = b^{ \sum_{i = 1}^{n} \log_{b} x_i }, \end{equation} \] and \[ \begin{equation} \sigma_g = b ^ { \sqrt{ \frac{\sum_{i = 1}^{n} \left( \log_{b} \frac{x_i}{\mu_g} \right)^2}{n} } } \end{equation} \] or, for clarity, \[ \begin{equation} \log_{b} \sigma_g = \sqrt{ \frac{\sum_{i = 1}^{n} \left( \log_{b} \frac{x_i}{\mu_g} \right)^2}{n}} \end{equation} \]

When looking for the geometric standard deviation in R, the simple exp(sd(log(x))) is not exactly correct. The geometric standard deviation uses $n,$ the full sample size, in the denominator, where as the sd and var functions in R use the denominator $n - 1.$ To get the geometric standard deviation one should adjust the result by multiplying the variance by $(n - 1) / n$ or the standard deviation by $\sqrt{(n - 1) / n}.$ See the example below.

x <- runif(6, min = 4, max = 70)

# geometric mean
mu_g <- prod(x) ** (1 / length(x))
mu_g
## [1] 46.50714
exp(mean(log(x)))
## [1] 46.50714
1.2 ** mean(log(x, base = 1.2))
## [1] 46.50714

# geometric standard deviation
exp(sd(log(x)))  ## This is wrong
## [1] 1.500247

# these equations are correct
sigma_g <- exp(sqrt(sum(log(x / mu_g) ** 2) / length(x)))
sigma_g
## [1] 1.448151

exp(sqrt((length(x) - 1) / length(x)) * sd(log(x)))
## [1] 1.448151

The functions gmean`,`gvar, and gsd provide the geometric mean, variance, and standard deviation for a numeric vector.

gmean(x)
## [1] 46.50714
all.equal(gmean(x), mu_g)
## [1] TRUE

gvar(x)
## [1] 1.146958
all.equal(gvar(x), sigma_g^2)  # This is supposed to be FALSE
## [1] "Mean relative difference: 0.8284385"
all.equal(gvar(x), exp(log(sigma_g)^2))
## [1] TRUE

gsd(x)
## [1] 1.448151
all.equal(gsd(x), sigma_g)
## [1] TRUE

gmean_sd will provide a quick way for reporting the geometric mean and geometric standard deviation in the same way that mean_sd does for the arithmetic mean and arithmetic standard deviation:

gmean_sd(x)
## [1] "46.51 &plusmn; 1.45"

3 Building a Data Summary Table

Objective: build a table reporting summary statistics for some of the variables in the mtcars2 data.frame overall and within subgroups. We’ll start with something very simple and build up to something bigger.

Let’s report the min, max, and mean (sd) for continuous variables and n (%) for categorical variables. We will report mpg, displacement (disp), wt (weight), and gear overall and by number of cylinders and transmission type.

END USER VISIBLE CHANGE: for qwraps2 version before 0.4.2 the summary_table method relied on *[dplyr](https://cran.r-project.org/package=dplyr)* verbs for the implementation and end user specifications. This created several limitations and required what could be considered a non-intuitive api due to the use of the rlang data pronoun .data`.` Building a table with the summary based on a grouping, e.g., mpg by number of cylinders, was achieved by the use ofdplyr::group_byto specify the group. Further only one grouping variable was supported. Starting with version 0.5.0 the implementation of thesummary_tableandqsummaryis based on base R methods. The change in the implementation will make it easier for all users as the use of the tidyverse is no longer required or assumed. The use ofdplyr::group_by`` is still supported, and has been improved.

There are two changes to the API: 1. Use of the data pronoun .data`.` is no longer recommend. In fact, it is now discouraged. There is a test in place in version 0.5.0 which will provided a message to this effect. 2. A new function argumentbyas been added to thesummary_tablemethod such that the use ofdplyr::group_by`` is no longer needed.

The use of the summary_table use to define a summary, that is, a list-of-lists of formulas for summarizing the data.frame.

The inner lists are named formulae defining the wanted summary. The names are important, as they are used to label row groups and row names in the table.

our_summary1 <-
  list("Miles Per Gallon" =
       list("min"       = ~ min(mpg),
            "max"       = ~ max(mpg),
            "mean (sd)" = ~ qwraps2::mean_sd(mpg)),
       "Displacement" =
       list("min"       = ~ min(disp),
            "median"    = ~ median(disp),
            "max"       = ~ max(disp),
            "mean (sd)" = ~ qwraps2::mean_sd(disp)),
       "Weight (1000 lbs)" =
       list("min"       = ~ min(wt),
            "max"       = ~ max(wt),
            "mean (sd)" = ~ qwraps2::mean_sd(wt)),
       "Forward Gears" =
       list("Three" = ~ qwraps2::n_perc0(gear == 3),
            "Four"  = ~ qwraps2::n_perc0(gear == 4),
            "Five"  = ~ qwraps2::n_perc0(gear == 5))
       )

Building the table is done with a call to summary_table

### Overall
whole <- summary_table(mtcars2, our_summary1)
whole

	mtcars2 (N = 32)
Miles Per Gallon
min	10.4
max	33.9
mean (sd)	20.09 ± 6.03
Displacement
min	71.1
median	196.3
max	472
mean (sd)	230.72 ± 123.94
Weight (1000 lbs)
min	1.513
max	5.424
mean (sd)	3.22 ± 0.98
Forward Gears
Three	15 (47)
Four	12 (38)
Five	5 (16)

If a grouped_df created by a call to dplyr::group_by is passed to summary_table then the resulting table will have one column for each group.

### By number of Cylinders
by_cyl <- summary_table(dplyr::group_by(mtcars2, cyl_factor), our_summary1)
by_cyl

	6 cylinders (N = 7)	4 cylinders (N = 11)	8 cylinders (N = 14)
Miles Per Gallon
min	17.8	21.4	10.4
max	21.4	33.9	19.2
mean (sd)	19.74 ± 1.45	26.66 ± 4.51	15.10 ± 2.56
Displacement
min	145	71.1	275.8
median	167.6	108	350.5
max	258	146.7	472
mean (sd)	183.31 ± 41.56	105.14 ± 26.87	353.10 ± 67.77
Weight (1000 lbs)
min	2.62	1.513	3.17
max	3.46	3.19	5.424
mean (sd)	3.12 ± 0.36	2.29 ± 0.57	4.00 ± 0.76
Forward Gears
Three	2 (29)	1 (9)	12 (86)
Four	4 (57)	8 (73)	0 (0)
Five	1 (14)	2 (18)	2 (14)

If you are not working in the tidyverse you can explicitly define the variables in the data.frame to group by, e.g.,

summary_table(mtcars2, summaries = our_summary1, by = c("cyl_factor"))

	6 cylinders (N = 7)	4 cylinders (N = 11)	8 cylinders (N = 14)
Miles Per Gallon
min	17.8	21.4	10.4
max	21.4	33.9	19.2
mean (sd)	19.74 ± 1.45	26.66 ± 4.51	15.10 ± 2.56
Displacement
min	145	71.1	275.8
median	167.6	108	350.5
max	258	146.7	472
mean (sd)	183.31 ± 41.56	105.14 ± 26.87	353.10 ± 67.77
Weight (1000 lbs)
min	2.62	1.513	3.17
max	3.46	3.19	5.424
mean (sd)	3.12 ± 0.36	2.29 ± 0.57	4.00 ± 0.76
Forward Gears
Three	2 (29)	1 (9)	12 (86)
Four	4 (57)	8 (73)	0 (0)
Five	1 (14)	2 (18)	2 (14)

With the refactor of the summary_table method in version 0.5.0 it is easier to group by multiple variables. For example, getting a column for combination of cylinders and transmission type:

by_cyl_am <- summary_table(mtcars2, summaries = our_summary1, by = c("cyl_factor", "am"))
by_cyl_am
## 
## 
## |                       |6 cylinders.0 (N = 4) |4 cylinders.0 (N = 3) |8 cylinders.0 (N = 12) |6 cylinders.1 (N = 3) |4 cylinders.1 (N = 8) |8 cylinders.1 (N = 2) |
## |:----------------------|:---------------------|:---------------------|:----------------------|:---------------------|:---------------------|:---------------------|
## |**Miles Per Gallon**   |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;           |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;          |
## |&nbsp;&nbsp; min       |17.8                  |21.5                  |10.4                   |19.7                  |21.4                  |15                    |
## |&nbsp;&nbsp; max       |21.4                  |24.4                  |19.2                   |21                    |33.9                  |15.8                  |
## |&nbsp;&nbsp; mean (sd) |19.12 &plusmn; 1.63   |22.90 &plusmn; 1.45   |15.05 &plusmn; 2.77    |20.57 &plusmn; 0.75   |28.07 &plusmn; 4.48   |15.40 &plusmn; 0.57   |
## |**Displacement**       |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;           |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;          |
## |&nbsp;&nbsp; min       |167.6                 |120.1                 |275.8                  |145                   |71.1                  |301                   |
## |&nbsp;&nbsp; median    |196.3                 |140.8                 |355                    |160                   |87.05                 |326                   |
## |&nbsp;&nbsp; max       |258                   |146.7                 |472                    |160                   |121                   |351                   |
## |&nbsp;&nbsp; mean (sd) |204.55 &plusmn; 44.74 |135.87 &plusmn; 13.97 |357.62 &plusmn; 71.82  |155.00 &plusmn; 8.66  |93.61 &plusmn; 20.48  |326.00 &plusmn; 35.36 |
## |**Weight (1000 lbs)**  |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;           |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;          |
## |&nbsp;&nbsp; min       |3.215                 |2.465                 |3.435                  |2.62                  |1.513                 |3.17                  |
## |&nbsp;&nbsp; max       |3.46                  |3.19                  |5.424                  |2.875                 |2.78                  |3.57                  |
## |&nbsp;&nbsp; mean (sd) |3.39 &plusmn; 0.12    |2.94 &plusmn; 0.41    |4.10 &plusmn; 0.77     |2.75 &plusmn; 0.13    |2.04 &plusmn; 0.41    |3.37 &plusmn; 0.28    |
## |**Forward Gears**      |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;           |&nbsp;&nbsp;          |&nbsp;&nbsp;          |&nbsp;&nbsp;          |
## |&nbsp;&nbsp; Three     |2 (50)                |1 (33)                |12 (100)               |0 (0)                 |0 (0)                 |0 (0)                 |
## |&nbsp;&nbsp; Four      |2 (50)                |2 (67)                |0 (0)                  |2 (67)                |6 (75)                |0 (0)                 |
## |&nbsp;&nbsp; Five      |0 (0)                 |0 (0)                 |0 (0)                  |1 (33)                |2 (25)                |2 (100)               |

A quick check here shows that the multiple grouping via dplyr is the same as above.

all.equal(summary_table(dplyr::group_by(mtcars2, cyl_factor, am), summaries = our_summary1),
          by_cyl_am)
## [1] TRUE

One more note, if you pass a grouped_df to summary_table while specifying the by argument a warning will be thrown and the grouped_df groupings will take precedence.

summary_table(dplyr::group_by(mtcars2, carb), summaries = our_summary1, by = c("cyl_factor", "am"))
## Warning in summary_table.grouped_df(dplyr::group_by(mtcars2, carb), summaries =
## our_summary1, : You've passed a grouped_df to summary_table and specified the by
## argument. The by argument will be ignored.

	1 (N = 7)	2 (N = 10)	3 (N = 3)	4 (N = 10)	6 (N = 1)	8 (N = 1)
Miles Per Gallon
min	18.1	15.2	15.2	10.4	19.7	15
max	33.9	30.4	17.3	21	19.7	15
mean (sd)	25.34 ± 6.00	22.40 ± 5.47	16.30 ± 1.05	15.79 ± 3.91	19.70 ± NA	15.00 ± NA
Displacement
min	71.1	75.7	275.8	160	145	301
median	108	143.75	275.8	350.5	145	301
max	258	400	275.8	472	145	301
mean (sd)	134.27 ± 75.90	208.16 ± 122.50	275.80 ± 0.00	308.82 ± 132.06	145.00 ± NA	301.00 ± NA
Weight (1000 lbs)
min	1.835	1.513	3.73	2.62	2.77	3.57
max	3.46	3.845	4.07	5.424	2.77	3.57
mean (sd)	2.49 ± 0.62	2.86 ± 0.83	3.86 ± 0.18	3.90 ± 1.05	2.77 ± NA	3.57 ± NA
Forward Gears
Three	3 (43)	4 (40)	3 (100)	5 (50)	0 (0)	0 (0)
Four	4 (57)	4 (40)	0 (0)	4 (40)	0 (0)	0 (0)
Five	0 (0)	2 (20)	0 (0)	1 (10)	1 (100)	1 (100)

To report a table with both the whole sample summary and conditional columns together:

both <- cbind(whole, by_cyl)
both

	mtcars2 (N = 32)	6 cylinders (N = 7)	4 cylinders (N = 11)	8 cylinders (N = 14)
Miles Per Gallon
min	10.4	17.8	21.4	10.4
max	33.9	21.4	33.9	19.2
mean (sd)	20.09 ± 6.03	19.74 ± 1.45	26.66 ± 4.51	15.10 ± 2.56
Displacement
min	71.1	145	71.1	275.8
median	196.3	167.6	108	350.5
max	472	258	146.7	472
mean (sd)	230.72 ± 123.94	183.31 ± 41.56	105.14 ± 26.87	353.10 ± 67.77
Weight (1000 lbs)
min	1.513	2.62	1.513	3.17
max	5.424	3.46	3.19	5.424
mean (sd)	3.22 ± 0.98	3.12 ± 0.36	2.29 ± 0.57	4.00 ± 0.76
Forward Gears
Three	15 (47)	2 (29)	1 (9)	12 (86)
Four	12 (38)	4 (57)	8 (73)	0 (0)
Five	5 (16)	1 (14)	2 (18)	2 (14)

If you want to change the column names, do so via the cnames argument to qable via the print method for qwraps2_summary_table objects. Any argument that you want to send to qable can be sent there when explicitly using the print method for qwraps2_summary_table objects.

print(both,
      rtitle = "Summary Statistics",
      cnames = c("Col 0", "Col 1", "Col 2", "Col 3"))

Summary Statistics	Col 0	Col 1	Col 2	Col 3
Miles Per Gallon
min	10.4	17.8	21.4	10.4
max	33.9	21.4	33.9	19.2
mean (sd)	20.09 ± 6.03	19.74 ± 1.45	26.66 ± 4.51	15.10 ± 2.56
Displacement
min	71.1	145	71.1	275.8
median	196.3	167.6	108	350.5
max	472	258	146.7	472
mean (sd)	230.72 ± 123.94	183.31 ± 41.56	105.14 ± 26.87	353.10 ± 67.77
Weight (1000 lbs)
min	1.513	2.62	1.513	3.17
max	5.424	3.46	3.19	5.424
mean (sd)	3.22 ± 0.98	3.12 ± 0.36	2.29 ± 0.57	4.00 ± 0.76
Forward Gears
Three	15 (47)	2 (29)	1 (9)	12 (86)
Four	12 (38)	4 (57)	8 (73)	0 (0)
Five	5 (16)	1 (14)	2 (18)	2 (14)

3.1 Easy building of the summaries

The task of building the summaries list-of-lists can be tedious. The function qummaries is designed to make it easier. qummaries will use a set of predefined functions to summarize numeric columns of a data.frame, a set of arguments to pass to n_perc for categorical (character and factor) variables.

By default, calling summary_table will use the default summary metrics defined by qsummary`.` The purpose ofqsummaryis to provide the same summary for all numeric variables within a data.frame and a single style of summary for categorical variables within the data.frame. For example, the default summary for a set of variables from themtcars2`` data set is

qsummary(mtcars2[, c("mpg", "cyl_factor", "wt")])
## $mpg
## $mpg$minimum
## ~qwraps2::frmt(min(mpg))
## 
## $mpg$`median (IQR)`
## ~qwraps2::median_iqr(mpg)
## 
## $mpg$`mean (sd)`
## ~qwraps2::mean_sd(mpg)
## 
## $mpg$maximum
## ~qwraps2::frmt(max(mpg))
## 
## 
## $cyl_factor
## $cyl_factor$`6 cylinders`
## ~qwraps2::n_perc(cyl_factor == "6 cylinders", digits = 0, show_symbol = FALSE)
## 
## $cyl_factor$`4 cylinders`
## ~qwraps2::n_perc(cyl_factor == "4 cylinders", digits = 0, show_symbol = FALSE)
## 
## $cyl_factor$`8 cylinders`
## ~qwraps2::n_perc(cyl_factor == "8 cylinders", digits = 0, show_symbol = FALSE)
## 
## 
## $wt
## $wt$minimum
## ~qwraps2::frmt(min(wt))
## 
## $wt$`median (IQR)`
## ~qwraps2::median_iqr(wt)
## 
## $wt$`mean (sd)`
## ~qwraps2::mean_sd(wt)
## 
## $wt$maximum
## ~qwraps2::frmt(max(wt))

That default summary is used for a table as follows:

summary_table(mtcars2[, c("mpg", "cyl_factor", "wt")])

	mtcars2[, c(“mpg”, “cyl_factor”, “wt”)] (N = 32)
mpg
minimum	10.40
median (IQR)	19.20 (15.43, 22.80)
mean (sd)	20.09 ± 6.03
maximum	33.90
cyl_factor
6 cylinders	7 (22)
4 cylinders	11 (34)
8 cylinders	14 (44)
wt
minimum	1.51
median (IQR)	3.33 (2.58, 3.61)
mean (sd)	3.22 ± 0.98
maximum	5.42

Now, say we want to only report the minimum and maximum for each of the numeric variables and for the categorical variables we want two show the denominator for each category and for the percentage, to one digit with the percent symbol in the table. Note that when defining the list of numeric_summaries that the argument place holder is the %s character.

new_summary <-
  qsummary(mtcars2[, c("mpg", "cyl_factor", "wt")],
           numeric_summaries = list("Minimum" = "~ min(%s)",
                                    "Maximum" = "~ max(%s)"),
           n_perc_args = list(digits = 1, show_symbol = TRUE, show_denom = "always"))
str(new_summary)
## List of 3
##  $ mpg       :List of 2
##   ..$ Minimum:Class 'formula'  language ~min(mpg)
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   ..$ Maximum:Class 'formula'  language ~max(mpg)
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  $ cyl_factor:List of 3
##   ..$ 6 cylinders:Class 'formula'  language ~qwraps2::n_perc(cyl_factor == "6 cylinders", digits = 1, show_symbol = TRUE,      show_denom = "always")
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   ..$ 4 cylinders:Class 'formula'  language ~qwraps2::n_perc(cyl_factor == "4 cylinders", digits = 1, show_symbol = TRUE,      show_denom = "always")
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   ..$ 8 cylinders:Class 'formula'  language ~qwraps2::n_perc(cyl_factor == "8 cylinders", digits = 1, show_symbol = TRUE,      show_denom = "always")
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  $ wt        :List of 2
##   ..$ Minimum:Class 'formula'  language ~min(wt)
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   ..$ Maximum:Class 'formula'  language ~max(wt)
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>

The resulting table is:

summary_table(mtcars2, new_summary)

	mtcars2 (N = 32)
mpg
Minimum	10.4
Maximum	33.9
cyl_factor
6 cylinders	7/32 (21.9%)
4 cylinders	11/32 (34.4%)
8 cylinders	14/32 (43.8%)
wt
Minimum	1.513
Maximum	5.424

The summary can easily be used with a by argument

summary_table(mtcars2, new_summary, by = c("cyl_factor"))

	6 cylinders (N = 7)	4 cylinders (N = 11)	8 cylinders (N = 14)
mpg
Minimum	17.8	21.4	10.4
Maximum	21.4	33.9	19.2
cyl_factor
6 cylinders	7/7 (100.0%)	0/11 (0.0%)	0/14 (0.0%)
4 cylinders	0/7 (0.0%)	11/11 (100.0%)	0/14 (0.0%)
8 cylinders	0/7 (0.0%)	0/11 (0.0%)	14/14 (100.0%)
wt
Minimum	2.62	1.513	3.17
Maximum	3.46	3.19	5.424

3.2 Adding P-values to a Summary Table

There are many different ways to format data summary tables. Adding p-values to a table is just one thing that can be done in more than one way. For example, if a row group reports the counts and percentages for each level of a categorical variable across multiple (column) groups, then I would argue that the p-value resulting from a chi square test or a Fisher exact test would be best placed on the line of the table labeling the row group. However, say we reported the minimum, median, mean, and maximum with in a row group for one variable. The p-value from a t-test, or other meaningful test, for the difference in mean I would suggest should be reported on the line of the summary table for the mean, not the row group itself.

With so many possibilities I have reserved construction of a p-value column to be ad hoc. Perhaps an additional column wouldn’t be used and the p-values are edited into row group labels, for example.

If you want to add a p-value column to a qwraps2_summary_table object you can with some degree of ease. Note that qwraps2_summary_table objects are just character matrices.

str(both)
##  'qwraps2_summary_table' chr [1:13, 1:4] "10.4" "33.9" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:13] "min" "max" "mean (sd)" "min" ...
##   ..$ : chr [1:4] "mtcars2 (N = 32)" "6 cylinders (N = 7)" "4 cylinders (N = 11)" "8 cylinders (N = 14)"
##  - attr(*, "rgroups")= Named int [1:4] 3 4 3 3
##   ..- attr(*, "names")= chr [1:4] "Miles Per Gallon" "Displacement" "Weight (1000 lbs)" "Forward Gears"

Let’s added p-values for testing the difference in the mean between the three cylinder groups.

# difference in means
mpvals <-
  sapply(
         list(lm(mpg ~ cyl_factor,  data = mtcars2),
              lm(disp ~ cyl_factor, data = mtcars2),
              lm(wt ~ cyl_factor,   data = mtcars2)),
         extract_fpvalue)

# Fisher test
fpval <- frmtp(fisher.test(table(mtcars2$gear, mtcars2$cyl_factor))$p.value)

Adding the p-value column is done as follows:

both <- cbind(both, "P-value" = "")
both[grepl("mean \\(sd\\)", rownames(both)), "P-value"] <- mpvals
a <- capture.output(print(both))
a[grepl("Forward Gears", a)] <-
  sub("&nbsp;&nbsp;\\ \\|$", paste(fpval, "|"), a[grepl("Forward Gears", a)])

and the resulting table is:

cat(a, sep = "\n")

	mtcars2 (N = 32)	6 cylinders (N = 7)	4 cylinders (N = 11)	8 cylinders (N = 14)	P-value
Miles Per Gallon
min	10.4	17.8	21.4	10.4
max	33.9	21.4	33.9	19.2
mean (sd)	20.09 ± 6.03	19.74 ± 1.45	26.66 ± 4.51	15.10 ± 2.56	P < 0.0001
Displacement
min	71.1	145	71.1	275.8
median	196.3	167.6	108	350.5
max	472	258	146.7	472
mean (sd)	230.72 ± 123.94	183.31 ± 41.56	105.14 ± 26.87	353.10 ± 67.77	P < 0.0001
Weight (1000 lbs)
min	1.513	2.62	1.513	3.17
max	5.424	3.46	3.19	5.424
mean (sd)	3.22 ± 0.98	3.12 ± 0.36	2.29 ± 0.57	4.00 ± 0.76	P < 0.0001
Forward Gears					P < 0.0001
Three	15 (47)	2 (29)	1 (9)	12 (86)
Four	12 (38)	4 (57)	8 (73)	0 (0)
Five	5 (16)	1 (14)	2 (18)	2 (14)

Another option you might consider is to have the p-value in the row group name. Consider the following construction. The p-values are added to the names of the row groups when building the summary table.

gear_summary <-
  list("Forward Gears" =
       list("Three" = ~ qwraps2::n_perc0(gear == 3),
            "Four"  = ~ qwraps2::n_perc0(gear == 4),
            "Five"  = ~ qwraps2::n_perc0(gear == 5)),
       "Transmission" =
       list("Automatic" = ~ qwraps2::n_perc0(am == 0),
            "Manual"    = ~ qwraps2::n_perc0(am == 1))
       )

gear_summary <-
setNames(gear_summary,
         c(
         paste("Forward Gears: ", frmtp(fisher.test(xtabs( ~ gear + cyl_factor, data = mtcars2))$p.value)),
         paste("Transmission: ",  frmtp(fisher.test(xtabs( ~ am + cyl_factor, data = mtcars2))$p.value)))
         )

summary_table(mtcars2, gear_summary, by = "cyl_factor")

	6 cylinders (N = 7)	4 cylinders (N = 11)	8 cylinders (N = 14)
Forward Gears: P < 0.0001
Three	2 (29)	1 (9)	12 (86)
Four	4 (57)	8 (73)	0 (0)
Five	1 (14)	2 (18)	2 (14)
Transmission: P = 0.0091
Automatic	4 (57)	3 (27)	12 (86)
Manual	3 (43)	8 (73)	2 (14)

3.3 Using Variable Labels

Some data management paradigms will use attributes to keep a label associated with a variable in a data.frame. Notable examples are the *[Hmisc](https://cran.r-project.org/package=Hmisc)* and *[sjPlot](https://cran.r-project.org/package=sjPlot)*. If you associate a label with a variable in the data frame the that label will be used when building a summary table. This feature was suggested https://github.com/dewittpe/qwraps2/issues/74 and implemented thusly:

new_data_frame <-
  data.frame(age = c(18, 20, 24, 17, 43),
             edu = c(1, 3, 1, 5, 2),
             rt  = c(0.01, 0.04, 0.02, 0.10, 0.06))

# Set a label for the variables
attr(new_data_frame$age, "label") <- "Age in years"
attr(new_data_frame$rt,  "label") <- "Reaction time"

# mistakenly set the attribute to name instead of label
attr(new_data_frame$edu, "name") <- "Education"

When calling qsummary the provide labels for the age and rt variables will be used. Since the attribute “label” does not exist for the edu variable, edu will be used in the output.

qsummary(new_data_frame)
## $`Age in years`
## $`Age in years`$minimum
## ~qwraps2::frmt(min(age))
## 
## $`Age in years`$`median (IQR)`
## ~qwraps2::median_iqr(age)
## 
## $`Age in years`$`mean (sd)`
## ~qwraps2::mean_sd(age)
## 
## $`Age in years`$maximum
## ~qwraps2::frmt(max(age))
## 
## 
## $edu
## $edu$minimum
## ~qwraps2::frmt(min(edu))
## 
## $edu$`median (IQR)`
## ~qwraps2::median_iqr(edu)
## 
## $edu$`mean (sd)`
## ~qwraps2::mean_sd(edu)
## 
## $edu$maximum
## ~qwraps2::frmt(max(edu))
## 
## 
## $`Reaction time`
## $`Reaction time`$minimum
## ~qwraps2::frmt(min(rt))
## 
## $`Reaction time`$`median (IQR)`
## ~qwraps2::median_iqr(rt)
## 
## $`Reaction time`$`mean (sd)`
## ~qwraps2::mean_sd(rt)
## 
## $`Reaction time`$maximum
## ~qwraps2::frmt(max(rt))

This behavior is also seen with the summary_table call

summary_table(new_data_frame)

	new_data_frame (N = 5)
Age in years
minimum	17.00
median (IQR)	20.00 (18.00, 24.00)
mean (sd)	24.40 ± 10.74
maximum	43.00
edu
minimum	1.00
median (IQR)	2.00 (1.00, 3.00)
mean (sd)	2.40 ± 1.67
maximum	5.00
Reaction time
minimum	0.01
median (IQR)	0.04 (0.02, 0.06)
mean (sd)	0.05 ± 0.04
maximum	0.10

Formatted Summary Statistics and Data Summary Tables with qwraps2

Peter DeWitt

1 Introduction

1.1 Prerequisites Example Data Set