In this vignette we would like to discuss the similarities and differences between dplyr
and rtable
.
Much of the rtables
framework focuses on tabulation/summarizing of data and then the visualization of the table. In this vignette we focus on summarizing data using dplyr
and contrast it to rtables
. We won’t pay attention to the table visualizaion/markup and just derive the cell content.
Using dplyr
to summarize data and gt
to visualize the table is a good way if the tabulation is of a certain nature or complexity. However, there are tables such as the table created in the introduction
vignette that take some effort to create with dplyr
. Part of the effort is due to fact that when using dplyr the table data is stored in data.frame
s or tibble
s which is not the most natural way to represent a table as we will show in this vignette.
If you know a more elegant way of deriving the table content with dplyr
please let us know and we will update the vignette.
library(rtables)
library(dplyr)
Here is table and data used in the introduction
vignette:
<- 400
n
set.seed(1)
<- tibble(
df arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")),
country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")),
gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")),
handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")),
age = rchisq(n, 30) + 10
%>% mutate(
) weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180)
)
basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
add_colcounts() %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
split_rows_by("handed") %>%
summarize_row_groups() %>%
analyze("age", afun = mean, format = "xx.x") %>%
build_table(df)
Arm A Arm B
Female Male Female Male
(N=96) (N=105) (N=92) (N=107)
————————————————————————————————————————————————————————————
CAN 45 (46.9%) 64 (61.0%) 46 (50.0%) 62 (57.9%)
Left 32 (33.3%) 42 (40.0%) 26 (28.3%) 37 (34.6%)
mean 38.9 40.4 40.3 37.7
Right 13 (13.5%) 22 (21.0%) 20 (21.7%) 25 (23.4%)
mean 36.6 40.2 40.2 40.6
USA 51 (53.1%) 41 (39.0%) 46 (50.0%) 45 (42.1%)
Left 34 (35.4%) 19 (18.1%) 25 (27.2%) 25 (23.4%)
mean 40.4 39.7 39.2 40.1
Right 17 (17.7%) 22 (21.0%) 21 (22.8%) 20 (18.7%)
mean 36.9 39.8 38.5 39.0
We will start by deriving the first data cell on row 3 (note, row 1 and 2 have content cells, see the introduction
vignette). Cell 3,1 contains the mean age for left handed & female Canadians in “Arm A”:
mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female" & df$handed == "Left"])
[1] 38.86979
or with dplyr
:
%>%
df filter(country == "CAN", arm == "Arm A", gender == "Female", handed == "Left") %>%
summarise(mean_age = mean(age))
# A tibble: 1 × 1
mean_age
<dbl>
1 38.9
Further, dplyr
gives us other verbs to easily get the average age of left handed Canadians for each group defined by the 4 columns:
%>%
df group_by(arm, gender) %>%
filter(country == "CAN", handed == "Left") %>%
summarise(mean_age = mean(age))
`summarise()` has grouped output by 'arm'. You can override using the `.groups`
argument.
# A tibble: 4 × 3
# Groups: arm [2]
arm gender mean_age
<fct> <fct> <dbl>
1 Arm A Female 38.9
2 Arm A Male 40.4
3 Arm B Female 40.3
4 Arm B Male 37.7
We can further get to all the average age cell values with:
<- df %>%
average_age group_by(arm, gender, country, handed) %>%
summarise(mean_age = mean(age))
`summarise()` has grouped output by 'arm', 'gender', 'country'. You can
override using the `.groups` argument.
average_age
# A tibble: 16 × 5
# Groups: arm, gender, country [8]
arm gender country handed mean_age
<fct> <fct> <fct> <fct> <dbl>
1 Arm A Female CAN Left 38.9
2 Arm A Female CAN Right 36.6
3 Arm A Female USA Left 40.4
4 Arm A Female USA Right 36.9
5 Arm A Male CAN Left 40.4
6 Arm A Male CAN Right 40.2
7 Arm A Male USA Left 39.7
8 Arm A Male USA Right 39.8
9 Arm B Female CAN Left 40.3
10 Arm B Female CAN Right 40.2
11 Arm B Female USA Left 39.2
12 Arm B Female USA Right 38.5
13 Arm B Male CAN Left 37.7
14 Arm B Male CAN Right 40.6
15 Arm B Male USA Left 40.1
16 Arm B Male USA Right 39.0
In rtable
syntax we need the following code to get to the same content:
basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
split_rows_by("country") %>%
split_rows_by("handed") %>%
analyze("age", afun = mean, format = "xx.x") %>%
build_table(df)
Arm A Arm B
Female Male Female Male
————————————————————————————————————————
CAN
Left
mean 38.9 40.4 40.3 37.7
Right
mean 36.6 40.2 40.2 40.6
USA
Left
mean 40.4 39.7 39.2 40.1
Right
mean 36.9 39.8 38.5 39.0
As mentioned in the introduction to this vignette, please ignore the difference in arranging and formatting the data: it’s possible to condense the rtable
more and it is possible to make the tibble look more like the reference table using the gt
R package.
In terms of tabulation for this example there was arguably not much added by rtables
over dplyr
.
We will now focus on the count percentage information for handedness for each country. For that we need to get the correct denominator for the percentages
%>%
df group_by(arm, gender, country, handed) %>%
summarize(mean = mean(age), c_h_count = n()) %>%
# now the `handed` grouping has been removed, therefore we can calculate percent now:
mutate(c_h_percent = c_h_count / sum(c_h_count))
`summarise()` has grouped output by 'arm', 'gender', 'country'. You can
override using the `.groups` argument.
# A tibble: 16 × 7
# Groups: arm, gender, country [8]
arm gender country handed mean c_h_count c_h_percent
<fct> <fct> <fct> <fct> <dbl> <int> <dbl>
1 Arm A Female CAN Left 38.9 32 0.711
2 Arm A Female CAN Right 36.6 13 0.289
3 Arm A Female USA Left 40.4 34 0.667
4 Arm A Female USA Right 36.9 17 0.333
5 Arm A Male CAN Left 40.4 42 0.656
6 Arm A Male CAN Right 40.2 22 0.344
7 Arm A Male USA Left 39.7 19 0.463
8 Arm A Male USA Right 39.8 22 0.537
9 Arm B Female CAN Left 40.3 26 0.565
10 Arm B Female CAN Right 40.2 20 0.435
11 Arm B Female USA Left 39.2 25 0.543
12 Arm B Female USA Right 38.5 21 0.457
13 Arm B Male CAN Left 37.7 37 0.597
14 Arm B Male CAN Right 40.6 25 0.403
15 Arm B Male USA Left 40.1 25 0.556
16 Arm B Male USA Right 39.0 20 0.444
which has 16 rows (cells) like the average_age
data frame defined above. Next, we will derive the group information for countries:
%>%
df group_by(arm, gender, country, handed) %>%
summarize(mean = mean(age), c_h_count = n()) %>%
mutate(c_h_percent = c_h_count / sum(c_h_count)) %>%
mutate(c_count = sum(c_h_count)) %>%
ungroup(country) %>%
# note that we always use the finest level of counts to avoid duplicate counting
mutate(c_percent = c_count / sum(c_h_count))
`summarise()` has grouped output by 'arm', 'gender', 'country'. You can
override using the `.groups` argument.
# A tibble: 16 × 9
# Groups: arm, gender [4]
arm gender country handed mean c_h_count c_h_percent c_count c_percent
<fct> <fct> <fct> <fct> <dbl> <int> <dbl> <int> <dbl>
1 Arm A Female CAN Left 38.9 32 0.711 45 0.469
2 Arm A Female CAN Right 36.6 13 0.289 45 0.469
3 Arm A Female USA Left 40.4 34 0.667 51 0.531
4 Arm A Female USA Right 36.9 17 0.333 51 0.531
5 Arm A Male CAN Left 40.4 42 0.656 64 0.610
6 Arm A Male CAN Right 40.2 22 0.344 64 0.610
7 Arm A Male USA Left 39.7 19 0.463 41 0.390
8 Arm A Male USA Right 39.8 22 0.537 41 0.390
9 Arm B Female CAN Left 40.3 26 0.565 46 0.5
10 Arm B Female CAN Right 40.2 20 0.435 46 0.5
11 Arm B Female USA Left 39.2 25 0.543 46 0.5
12 Arm B Female USA Right 38.5 21 0.457 46 0.5
13 Arm B Male CAN Left 37.7 37 0.597 62 0.579
14 Arm B Male CAN Right 40.6 25 0.403 62 0.579
15 Arm B Male USA Left 40.1 25 0.556 45 0.421
16 Arm B Male USA Right 39.0 20 0.444 45 0.421
Note that the information in c_count
and c_percent
must be repeated, as there are only 8 country group summaries but the rest of the data structure requires 16 rows for the information.
The rtables
call in contrast is:
basic_table() %>%
split_cols_by("arm") %>%
split_cols_by("gender") %>%
add_colcounts() %>%
split_rows_by("country") %>%
summarize_row_groups() %>%
split_rows_by("handed") %>%
summarize_row_groups() %>%
analyze("age", afun = mean, format = "xx.x") %>%
build_table(df)
Arm A Arm B
Female Male Female Male
(N=96) (N=105) (N=92) (N=107)
————————————————————————————————————————————————————————————
CAN 45 (46.9%) 64 (61.0%) 46 (50.0%) 62 (57.9%)
Left 32 (33.3%) 42 (40.0%) 26 (28.3%) 37 (34.6%)
mean 38.9 40.4 40.3 37.7
Right 13 (13.5%) 22 (21.0%) 20 (21.7%) 25 (23.4%)
mean 36.6 40.2 40.2 40.6
USA 51 (53.1%) 41 (39.0%) 46 (50.0%) 45 (42.1%)
Left 34 (35.4%) 19 (18.1%) 25 (27.2%) 25 (23.4%)
mean 40.4 39.7 39.2 40.1
Right 17 (17.7%) 22 (21.0%) 21 (22.8%) 20 (18.7%)
mean 36.9 39.8 38.5 39.0
Further, the rtable
syntax has hopefully also become a bit more straightforward to derive the cell values than with dplyr
for this particular table.
In this vignette learned that:
dplyr
and data.frame
or tibble
as data structure
dplyr
keeps simple things simplertables
streamlines the construction of complex tablesWe recommend that you continue reading the clinical_trials
vignette where we create a number of more advanced tables using layouts.