The tableone package is an R package that eases the construction of “Table 1”, i.e., patient baseline characteristics table commonly found in biomedical research papers. The packages can summarize both continuous and categorical variables mixed within one table. Categorical variables can be summarized as counts and/or percentages. Continuous variables can be summarized in the “normal” way (means and standard deviations) or “nonnormal” way (medians and interquartile ranges).
## tableone package itself
library(tableone)
## survival package for Mayo Clinic's PBC data
library(survival)
data(pbc)
The simplest use case is summarizing the whole dataset. You can just feed in the data frame to the main workhorse function CreateTableOne(). You can see there are 418 patients in the dataset.
CreateTableOne(data = pbc)
Overall
n 418
id (mean (SD)) 209.50 (120.81)
time (mean (SD)) 1917.78 (1104.67)
status (mean (SD)) 0.83 (0.96)
trt (mean (SD)) 1.49 (0.50)
age (mean (SD)) 50.74 (10.45)
sex = f (%) 374 (89.5)
ascites (mean (SD)) 0.08 (0.27)
hepato (mean (SD)) 0.51 (0.50)
spiders (mean (SD)) 0.29 (0.45)
edema (mean (SD)) 0.10 (0.25)
bili (mean (SD)) 3.22 (4.41)
chol (mean (SD)) 369.51 (231.94)
albumin (mean (SD)) 3.50 (0.42)
copper (mean (SD)) 97.65 (85.61)
alk.phos (mean (SD)) 1982.66 (2140.39)
ast (mean (SD)) 122.56 (56.70)
trig (mean (SD)) 124.70 (65.15)
platelet (mean (SD)) 257.02 (98.33)
protime (mean (SD)) 10.73 (1.02)
stage (mean (SD)) 3.02 (0.88)
Most of the categorical variables are coded numerically, so we either have to transform them to factors in the dataset or use factorVars argument to transform them on-the-fly. Also it’s a better practice to specify which variables to summarize by the vars argument, and exclude the ID variable(s). How do we know which ones are numerically-coded categorical variables? Please check your data dictionary (in this case help(pbc)). This time I am saving the result object in a variable.
## Get variables names
dput(names(pbc))
c("id", "time", "status", "trt", "age", "sex", "ascites", "hepato",
"spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos",
"ast", "trig", "platelet", "protime", "stage")
## Vector of variables to summarize
<- c("time", "status", "trt", "age", "sex", "ascites", "hepato",
myVars "spiders", "edema", "bili", "chol", "albumin", "copper", "alk.phos",
"ast", "trig", "platelet", "protime", "stage")
## Vector of categorical variables that need transformation
<- c("status", "trt", "ascites", "hepato",
catVars "spiders", "edema", "stage")
## Create a TableOne object
<- CreateTableOne(vars = myVars, data = pbc, factorVars = catVars) tab2
OK. It’s more interpretable now. Binary categorical variables are summarized as counts and percentages of the second level. For example, if it is coded as 0 and 1, the “1” level is summarized. For 3+ category variable all levels are summarized. Please bear in mind, the percentages are calculated after excluding missing values.
tab2
Overall
n 418
time (mean (SD)) 1917.78 (1104.67)
status (%)
0 232 (55.5)
1 25 ( 6.0)
2 161 (38.5)
trt = 2 (%) 154 (49.4)
age (mean (SD)) 50.74 (10.45)
sex = f (%) 374 (89.5)
ascites = 1 (%) 24 ( 7.7)
hepato = 1 (%) 160 (51.3)
spiders = 1 (%) 90 (28.8)
edema (%)
0 354 (84.7)
0.5 44 (10.5)
1 20 ( 4.8)
bili (mean (SD)) 3.22 (4.41)
chol (mean (SD)) 369.51 (231.94)
albumin (mean (SD)) 3.50 (0.42)
copper (mean (SD)) 97.65 (85.61)
alk.phos (mean (SD)) 1982.66 (2140.39)
ast (mean (SD)) 122.56 (56.70)
trig (mean (SD)) 124.70 (65.15)
platelet (mean (SD)) 257.02 (98.33)
protime (mean (SD)) 10.73 (1.02)
stage (%)
1 21 ( 5.1)
2 92 (22.3)
3 155 (37.6)
4 144 (35.0)
If you want to show all levels, you can use showAllLevels argument to the print() method.
print(tab2, showAllLevels = TRUE, formatOptions = list(big.mark = ","))
level Overall
n 418
time (mean (SD)) 1,917.78 (1,104.67)
status (%) 0 232 (55.5)
1 25 ( 6.0)
2 161 (38.5)
trt (%) 1 158 (50.6)
2 154 (49.4)
age (mean (SD)) 50.74 (10.45)
sex (%) m 44 (10.5)
f 374 (89.5)
ascites (%) 0 288 (92.3)
1 24 ( 7.7)
hepato (%) 0 152 (48.7)
1 160 (51.3)
spiders (%) 0 222 (71.2)
1 90 (28.8)
edema (%) 0 354 (84.7)
0.5 44 (10.5)
1 20 ( 4.8)
bili (mean (SD)) 3.22 (4.41)
chol (mean (SD)) 369.51 (231.94)
albumin (mean (SD)) 3.50 (0.42)
copper (mean (SD)) 97.65 (85.61)
alk.phos (mean (SD)) 1,982.66 (2,140.39)
ast (mean (SD)) 122.56 (56.70)
trig (mean (SD)) 124.70 (65.15)
platelet (mean (SD)) 257.02 (98.33)
protime (mean (SD)) 10.73 (1.02)
stage (%) 1 21 ( 5.1)
2 92 (22.3)
3 155 (37.6)
4 144 (35.0)
If you need more detailed information including the number/proportion missing. Use the summary() method on the result object. The continuous variables are shown first, and the categorical variables are shown second.
summary(tab2)
### Summary of continuous variables ###
strata: Overall
n miss p.miss mean sd median p25 p75 min max skew kurt
time 418 0 0.0 1918 1104.7 1730 1092.8 2614 41.0 4795 0.47 -0.5
age 418 0 0.0 51 10.4 51 42.8 58 26.3 78 0.09 -0.6
bili 418 0 0.0 3 4.4 1 0.8 3 0.3 28 2.72 8.1
chol 418 134 32.1 370 231.9 310 249.5 400 120.0 1775 3.41 14.3
albumin 418 0 0.0 3 0.4 4 3.2 4 2.0 5 -0.47 0.6
copper 418 108 25.8 98 85.6 73 41.2 123 4.0 588 2.30 7.6
alk.phos 418 106 25.4 1983 2140.4 1259 871.5 1980 289.0 13862 2.99 9.7
ast 418 106 25.4 123 56.7 115 80.6 152 26.4 457 1.45 4.3
trig 418 136 32.5 125 65.1 108 84.2 151 33.0 598 2.52 11.8
platelet 418 11 2.6 257 98.3 251 188.5 318 62.0 721 0.63 0.9
protime 418 2 0.5 11 1.0 11 10.0 11 9.0 18 2.22 10.0
=======================================================================================
### Summary of categorical variables ###
strata: Overall
var n miss p.miss level freq percent cum.percent
status 418 0 0.0 0 232 55.5 55.5
1 25 6.0 61.5
2 161 38.5 100.0
trt 418 106 25.4 1 158 50.6 50.6
2 154 49.4 100.0
sex 418 0 0.0 m 44 10.5 10.5
f 374 89.5 100.0
ascites 418 106 25.4 0 288 92.3 92.3
1 24 7.7 100.0
hepato 418 106 25.4 0 152 48.7 48.7
1 160 51.3 100.0
spiders 418 106 25.4 0 222 71.2 71.2
1 90 28.8 100.0
edema 418 0 0.0 0 354 84.7 84.7
0.5 44 10.5 95.2
1 20 4.8 100.0
stage 418 6 1.4 1 21 5.1 5.1
2 92 22.3 27.4
3 155 37.6 65.0
4 144 35.0 100.0
It looks like most of the continuous variables are highly skewed except time, age, albumin, and platelet (biomarkers are usually distributed with strong positive skews). Summarizing them as such may please your future peer reviewer(s). Let’s do it with the nonnormal argument to the print() method. Can you see the difference. If you just say nonnormal = TRUE, all variables are summarized the “nonnormal” way.
<- c("bili","chol","copper","alk.phos","ast","trig","protime")
biomarkers print(tab2, nonnormal = biomarkers, formatOptions = list(big.mark = ","))
Overall
n 418
time (mean (SD)) 1,917.78 (1,104.67)
status (%)
0 232 (55.5)
1 25 ( 6.0)
2 161 (38.5)
trt = 2 (%) 154 (49.4)
age (mean (SD)) 50.74 (10.45)
sex = f (%) 374 (89.5)
ascites = 1 (%) 24 ( 7.7)
hepato = 1 (%) 160 (51.3)
spiders = 1 (%) 90 (28.8)
edema (%)
0 354 (84.7)
0.5 44 (10.5)
1 20 ( 4.8)
bili (median [IQR]) 1.40 [0.80, 3.40]
chol (median [IQR]) 309.50 [249.50, 400.00]
albumin (mean (SD)) 3.50 (0.42)
copper (median [IQR]) 73.00 [41.25, 123.00]
alk.phos (median [IQR]) 1,259.00 [871.50, 1,980.00]
ast (median [IQR]) 114.70 [80.60, 151.90]
trig (median [IQR]) 108.00 [84.25, 151.00]
platelet (mean (SD)) 257.02 (98.33)
protime (median [IQR]) 10.60 [10.00, 11.10]
stage (%)
1 21 ( 5.1)
2 92 (22.3)
3 155 (37.6)
4 144 (35.0)
If you want to fine tune the table further, please check out ?print.TableOne for the full list of options.
Often you want to group patients and summarize group by group. It’s also pretty simple. Grouping by exposure categories is probably the most common way, so let’s do it by the treatment variable. According to ?pbc, it is coded as (1) D-penicillmain (it’s really “D-penicillamine”), (2) placebo, and (NA) not randomized. NA’s do not function as a grouping variable, so it is dropped. If you do want to show the result for the NA group, then you need to recoded it something other than NA.
<- CreateTableOne(vars = myVars, strata = "trt" , data = pbc, factorVars = catVars)
tab3 print(tab3, nonnormal = biomarkers, formatOptions = list(big.mark = ","))
Stratified by trt
1 2 p test
n 158 154
time (mean (SD)) 2,015.62 (1,094.12) 1,996.86 (1,155.93) 0.883
status (%) 0.894
0 83 (52.5) 85 ( 55.2)
1 10 ( 6.3) 9 ( 5.8)
2 65 (41.1) 60 ( 39.0)
trt = 2 (%) 0 ( 0.0) 154 (100.0) <0.001
age (mean (SD)) 51.42 (11.01) 48.58 (9.96) 0.018
sex = f (%) 137 (86.7) 139 ( 90.3) 0.421
ascites = 1 (%) 14 ( 8.9) 10 ( 6.5) 0.567
hepato = 1 (%) 73 (46.2) 87 ( 56.5) 0.088
spiders = 1 (%) 45 (28.5) 45 ( 29.2) 0.985
edema (%) 0.877
0 132 (83.5) 131 ( 85.1)
0.5 16 (10.1) 13 ( 8.4)
1 10 ( 6.3) 10 ( 6.5)
bili (median [IQR]) 1.40 [0.80, 3.20] 1.30 [0.72, 3.60] 0.842 nonnorm
chol (median [IQR]) 315.50 [247.75, 417.00] 303.50 [254.25, 377.00] 0.544 nonnorm
albumin (mean (SD)) 3.52 (0.44) 3.52 (0.40) 0.874
copper (median [IQR]) 73.00 [40.00, 121.00] 73.00 [43.00, 139.00] 0.717 nonnorm
alk.phos (median [IQR]) 1,214.50 [840.75, 2,028.00] 1,283.00 [922.50, 1,949.75] 0.812 nonnorm
ast (median [IQR]) 111.60 [76.73, 151.51] 117.40 [83.78, 151.90] 0.459 nonnorm
trig (median [IQR]) 106.00 [84.50, 146.00] 113.00 [84.50, 155.00] 0.370 nonnorm
platelet (mean (SD)) 258.75 (100.32) 265.20 (90.73) 0.555
protime (median [IQR]) 10.60 [10.03, 11.00] 10.60 [10.00, 11.40] 0.588 nonnorm
stage (%) 0.201
1 12 ( 7.6) 4 ( 2.6)
2 35 (22.2) 32 ( 20.8)
3 56 (35.4) 64 ( 41.6)
4 55 (34.8) 54 ( 35.1)
As you can see in the previous table, when there are two or more groups group comparison p-values are printed along with the table (well, let’s not argue the appropriateness of hypothesis testing for table 1 in an RCT for now.). Very small p-values are shown with the less than sign. The hypothesis test functions used by default are chisq.test() for categorical variables (with continuity correction) and oneway.test() for continous variables (with equal variance assumption, i.e., regular ANOVA). Two-group ANOVA is equivalent of t-test.
You may be worried about the nonnormal variables and small cell counts in the stage variable. In such a situation, you can use the nonnormal argument like before as well as the exact (test) argument in the print() method. Now kruskal.test() is used for the nonnormal continous variables and fisher.test() is used for categorical variables specified in the exact argument. kruskal.test() is equivalent to wilcox.test() in the two-group case. The column named test is to indicate which p-values were calculated using the non-default tests.
To also show standardized mean differences, use the smd option.
print(tab3, nonnormal = biomarkers, exact = "stage", smd = TRUE)
Stratified by trt
1 2 p test SMD
n 158 154
time (mean (SD)) 2015.62 (1094.12) 1996.86 (1155.93) 0.883 0.017
status (%) 0.894 0.054
0 83 (52.5) 85 ( 55.2)
1 10 ( 6.3) 9 ( 5.8)
2 65 (41.1) 60 ( 39.0)
trt = 2 (%) 0 ( 0.0) 154 (100.0) <0.001 NaN
age (mean (SD)) 51.42 (11.01) 48.58 (9.96) 0.018 0.270
sex = f (%) 137 (86.7) 139 ( 90.3) 0.421 0.111
ascites = 1 (%) 14 ( 8.9) 10 ( 6.5) 0.567 0.089
hepato = 1 (%) 73 (46.2) 87 ( 56.5) 0.088 0.207
spiders = 1 (%) 45 (28.5) 45 ( 29.2) 0.985 0.016
edema (%) 0.877 0.058
0 132 (83.5) 131 ( 85.1)
0.5 16 (10.1) 13 ( 8.4)
1 10 ( 6.3) 10 ( 6.5)
bili (median [IQR]) 1.40 [0.80, 3.20] 1.30 [0.72, 3.60] 0.842 nonnorm 0.171
chol (median [IQR]) 315.50 [247.75, 417.00] 303.50 [254.25, 377.00] 0.544 nonnorm 0.038
albumin (mean (SD)) 3.52 (0.44) 3.52 (0.40) 0.874 0.018
copper (median [IQR]) 73.00 [40.00, 121.00] 73.00 [43.00, 139.00] 0.717 nonnorm <0.001
alk.phos (median [IQR]) 1214.50 [840.75, 2028.00] 1283.00 [922.50, 1949.75] 0.812 nonnorm 0.037
ast (median [IQR]) 111.60 [76.73, 151.51] 117.40 [83.78, 151.90] 0.459 nonnorm 0.084
trig (median [IQR]) 106.00 [84.50, 146.00] 113.00 [84.50, 155.00] 0.370 nonnorm 0.017
platelet (mean (SD)) 258.75 (100.32) 265.20 (90.73) 0.555 0.067
protime (median [IQR]) 10.60 [10.03, 11.00] 10.60 [10.00, 11.40] 0.588 nonnorm 0.146
stage (%) 0.205 exact 0.246
1 12 ( 7.6) 4 ( 2.6)
2 35 (22.2) 32 ( 20.8)
3 56 (35.4) 64 ( 41.6)
4 55 (34.8) 54 ( 35.1)
My typical next step is to export the table to Excel for editing, and then to Word (clinical medical journals usually do not offer LaTeX submission).
The quick and dirty way that I used to do is copy and paste. Use the quote = TRUE argument to show the quotes and noSpaces = TRUE to remove spaces used to align text in the R console (the latter is optional). Now you can just copy and paste the whole thing to an Excel spread sheet. After pasting, click the small pasting icon to choose Use Text Import Wizard…, in the dialogue you can just click finish to fit the values in the appropriate cells. Then you can edit or re-align things as you like. I usualy center-align the group summaries, and right-aligh the p-values.
print(tab3, nonnormal = biomarkers, exact = "stage", quote = TRUE, noSpaces = TRUE)
"Stratified by trt"
"" "1" "2" "p" "test"
"n" "158" "154" "" ""
"time (mean (SD))" "2015.62 (1094.12)" "1996.86 (1155.93)" "0.883" ""
"status (%)" "" "" "0.894" ""
" 0" "83 (52.5)" "85 (55.2)" "" ""
" 1" "10 (6.3)" "9 (5.8)" "" ""
" 2" "65 (41.1)" "60 (39.0)" "" ""
"trt = 2 (%)" "0 (0.0)" "154 (100.0)" "<0.001" ""
"age (mean (SD))" "51.42 (11.01)" "48.58 (9.96)" "0.018" ""
"sex = f (%)" "137 (86.7)" "139 (90.3)" "0.421" ""
"ascites = 1 (%)" "14 (8.9)" "10 (6.5)" "0.567" ""
"hepato = 1 (%)" "73 (46.2)" "87 (56.5)" "0.088" ""
"spiders = 1 (%)" "45 (28.5)" "45 (29.2)" "0.985" ""
"edema (%)" "" "" "0.877" ""
" 0" "132 (83.5)" "131 (85.1)" "" ""
" 0.5" "16 (10.1)" "13 (8.4)" "" ""
" 1" "10 (6.3)" "10 (6.5)" "" ""
"bili (median [IQR])" "1.40 [0.80, 3.20]" "1.30 [0.72, 3.60]" "0.842" "nonnorm"
"chol (median [IQR])" "315.50 [247.75, 417.00]" "303.50 [254.25, 377.00]" "0.544" "nonnorm"
"albumin (mean (SD))" "3.52 (0.44)" "3.52 (0.40)" "0.874" ""
"copper (median [IQR])" "73.00 [40.00, 121.00]" "73.00 [43.00, 139.00]" "0.717" "nonnorm"
"alk.phos (median [IQR])" "1214.50 [840.75, 2028.00]" "1283.00 [922.50, 1949.75]" "0.812" "nonnorm"
"ast (median [IQR])" "111.60 [76.73, 151.51]" "117.40 [83.78, 151.90]" "0.459" "nonnorm"
"trig (median [IQR])" "106.00 [84.50, 146.00]" "113.00 [84.50, 155.00]" "0.370" "nonnorm"
"platelet (mean (SD))" "258.75 (100.32)" "265.20 (90.73)" "0.555" ""
"protime (median [IQR])" "10.60 [10.03, 11.00]" "10.60 [10.00, 11.40]" "0.588" "nonnorm"
"stage (%)" "" "" "0.205" "exact"
" 1" "12 (7.6)" "4 (2.6)" "" ""
" 2" "35 (22.2)" "32 (20.8)" "" ""
" 3" "56 (35.4)" "64 (41.6)" "" ""
" 4" "55 (34.8)" "54 (35.1)" "" ""
If you do not like the manual labor of copy-and-paste, you can potentially automate the task by the following way. The print() method for a TableOne object invisibly return a matrix identical to what you see. You can capture this by assignment to a variable (here tab3Mat). Do not use the quote argument in this case, the noSpaces argument is again optional. The self-contradictory printToggle = FALSE for the print() method avoids unnecessary printing if you wish. Then you can save the object to a CSV file. As it is a regular matrix object, you can save it to an Excel file using packages such as XLConnect.
<- print(tab3, nonnormal = biomarkers, exact = "stage", quote = FALSE, noSpaces = TRUE, printToggle = FALSE)
tab3Mat ## Save to a CSV file
write.csv(tab3Mat, file = "myTable.csv")
You may want to see the categorical or continous variables only. You can do this by accessing the CatTable part and ContTable part of the TableOne object as follows. summary() methods are defined for both as well as print() method with various arguments. Please see ?print.CatTable and ?print.ContTable for details.
## Categorical part only
$CatTable tab3
Stratified by trt
1 2 p test
n 158 154
status (%) 0.894
0 83 (52.5) 85 ( 55.2)
1 10 ( 6.3) 9 ( 5.8)
2 65 (41.1) 60 ( 39.0)
trt = 2 (%) 0 ( 0.0) 154 (100.0) <0.001
sex = f (%) 137 (86.7) 139 ( 90.3) 0.421
ascites = 1 (%) 14 ( 8.9) 10 ( 6.5) 0.567
hepato = 1 (%) 73 (46.2) 87 ( 56.5) 0.088
spiders = 1 (%) 45 (28.5) 45 ( 29.2) 0.985
edema (%) 0.877
0 132 (83.5) 131 ( 85.1)
0.5 16 (10.1) 13 ( 8.4)
1 10 ( 6.3) 10 ( 6.5)
stage (%) 0.201
1 12 ( 7.6) 4 ( 2.6)
2 35 (22.2) 32 ( 20.8)
3 56 (35.4) 64 ( 41.6)
4 55 (34.8) 54 ( 35.1)
## Continous part only
print(tab3$ContTable, nonnormal = biomarkers)
Stratified by trt
1 2 p test
n 158 154
time (mean (SD)) 2015.62 (1094.12) 1996.86 (1155.93) 0.883
age (mean (SD)) 51.42 (11.01) 48.58 (9.96) 0.018
bili (median [IQR]) 1.40 [0.80, 3.20] 1.30 [0.72, 3.60] 0.842 nonnorm
chol (median [IQR]) 315.50 [247.75, 417.00] 303.50 [254.25, 377.00] 0.544 nonnorm
albumin (mean (SD)) 3.52 (0.44) 3.52 (0.40) 0.874
copper (median [IQR]) 73.00 [40.00, 121.00] 73.00 [43.00, 139.00] 0.717 nonnorm
alk.phos (median [IQR]) 1214.50 [840.75, 2028.00] 1283.00 [922.50, 1949.75] 0.812 nonnorm
ast (median [IQR]) 111.60 [76.73, 151.51] 117.40 [83.78, 151.90] 0.459 nonnorm
trig (median [IQR]) 106.00 [84.50, 146.00] 113.00 [84.50, 155.00] 0.370 nonnorm
platelet (mean (SD)) 258.75 (100.32) 265.20 (90.73) 0.555
protime (median [IQR]) 10.60 [10.03, 11.00] 10.60 [10.00, 11.40] 0.588 nonnorm