This vignette is current as of furniture
1.9.12.
furniture
library(furniture)
We will first make a fictitious data set:
<- data.frame(a = rnorm(1000, 1.5, 2),
df b = seq(1, 1000, 1),
c = c(rep("control", 400), rep("Other", 70), rep("treatment", 500), rep("None", 30)),
d = c(sample(1:1000, 900, replace=TRUE), rep(-99, 100)))
There are four functions that we’ll demonstrate here:
washer
table1
tableC
tableF
washer
is a great function for quick data cleaning. In
situations where there are placeholders, extra levels in a factor, or
several values need to be changed to another.
library(dplyr)
<- df %>%
df mutate(d = washer(d, -99), ## changes the placeholder -99 to NA
c = washer(c, "Other", "None", value = "control")) ## changes "Other" and "None" to "Control"
Now that the data is “washed” we can start exploring and reporting.
table1(df, a, b, factor(c), d)
##
## ────────────────────────────────
## Mean/Count (SD/%)
## n = 900
## a
## 1.4 (2.1)
## b
## 450.5 (260.0)
## factor(c)
## control 470 (52.2%)
## treatment 430 (47.8%)
## d
## 481.4 (290.7)
## ────────────────────────────────
The variables must be numeric or factor. Since we use a special type
of evaluation (i.e. Non-Standard Evaluation) we can change the variables
in the function (e.g., factor(c)
). This can be extended to
making a whole new variable in the function as well.
table1(df, a, b, d, ifelse(a > 1, 1, 0))
##
## ───────────────────────────────────────
## Mean/Count (SD/%)
## n = 900
## a
## 1.4 (2.1)
## b
## 450.5 (260.0)
## d
## 481.4 (290.7)
## ifelse(a > 1, 1, 0)
## 0.6 (0.5)
## ───────────────────────────────────────
This is just the beginning though. Two powerful things the function can do are shown below:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE)
## Breusch-Pagan Test of Heteroskedasticity suggests `var.equal = FALSE` in t.test() for: b
##
## ─────────────────────────────────────────────────────────
## factor(c)
## control treatment P-Value
## n = 470 n = 430
## a 0.365
## 1.4 (2.0) 1.5 (2.1)
## b <.001
## 235.5 (135.8) 685.5 (124.3)
## d 0.785
## 483.9 (288.2) 478.6 (293.7)
## ifelse(a > 1, 1, 0) 0.438
## 0.6 (0.5) 0.6 (0.5)
## ─────────────────────────────────────────────────────────
The splitby = ~factor(c)
stratifies the means and counts
by a factor variable (in this case either control or treatment). When we
use this we can also automatically compute tests of significance using
test=TRUE
.
We can also use it intuitively within the pipe (for more about this, see the “Table 1” vignette):
%>%
df group_by(c) %>%
table1(a, b, d, ifelse(a > 1, 1, 0),
test=TRUE)
##
## ─────────────────────────────────────────────────────────
## c
## control treatment P-Value
## n = 470 n = 430
## a 0.365
## 1.4 (2.0) 1.5 (2.1)
## b <.001
## 235.5 (135.8) 685.5 (124.3)
## d 0.785
## 483.9 (288.2) 478.6 (293.7)
## ifelse(a > 1, 1, 0) 0.438
## 0.6 (0.5) 0.6 (0.5)
## ─────────────────────────────────────────────────────────
In this case, we used the group_by()
function from
dplyr
(within the tidyverse
) and
table1()
knows to use that as the grouping variable in
place of the splitby
argument.
If the parametric tests (default) are not appropriate, you can set
param = FALSE
.
%>%
df group_by(c) %>%
table1(a, b, d, ifelse(a > 1, 1, 0),
test=TRUE,
param=FALSE)
##
## ─────────────────────────────────────────────────────────
## c
## control treatment P-Value
## n = 470 n = 430
## a 0.358
## 1.4 (2.0) 1.5 (2.1)
## b <.001
## 235.5 (135.8) 685.5 (124.3)
## d 0.788
## 483.9 (288.2) 478.6 (293.7)
## ifelse(a > 1, 1, 0) 0.437
## 0.6 (0.5) 0.6 (0.5)
## ─────────────────────────────────────────────────────────
Finally, you can polish it quite a bit using a few other options. For example, you can do the following:
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
type = c("simple", "condensed"))
## Breusch-Pagan Test of Heteroskedasticity suggests `var.equal = FALSE` in t.test() for: B
##
## ─────────────────────────────────────────────
## factor(c)
## control treatment P-Value
## n = 470 n = 430
## A 1.4 (2.0) 1.5 (2.1) 0.365
## B 235.5 (135.8) 685.5 (124.3) <.001
## D 483.9 (288.2) 478.6 (293.7) 0.785
## New Var 0.6 (0.5) 0.6 (0.5) 0.438
## ─────────────────────────────────────────────
Note that var_names
can be used for more complex naming
(e.g., with spaces, brackets) that otherwise cannot be used with data
frames. Alternatively, for more simple naming, we can name them
directly.
table1(df, A = a, B = b, D = d, A2 = ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
type = c("simple", "condensed"))
## Breusch-Pagan Test of Heteroskedasticity suggests `var.equal = FALSE` in t.test() for: B
##
## ────────────────────────────────────────
## factor(c)
## control treatment P-Value
## n = 470 n = 430
## A 1.4 (2.0) 1.5 (2.1) 0.365
## B 235.5 (135.8) 685.5 (124.3) <.001
## D 483.9 (288.2) 478.6 (293.7) 0.785
## A2 0.6 (0.5) 0.6 (0.5) 0.438
## ────────────────────────────────────────
You can also format the numbers (adding a comma for big numbers such as in 20,000 instead of 20000):
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
format_number = TRUE)
## Breusch-Pagan Test of Heteroskedasticity suggests `var.equal = FALSE` in t.test() for: B
##
## ─────────────────────────────────────────────
## factor(c)
## control treatment P-Value
## n = 470 n = 430
## A 0.365
## 1.4 (2.0) 1.5 (2.1)
## B <.001
## 235.5 (135.8) 685.5 (124.3)
## D 0.785
## 483.9 (288.2) 478.6 (293.7)
## New Var 0.438
## 0.6 (0.5) 0.6 (0.5)
## ─────────────────────────────────────────────
The table can be exported directly to a folder in the working
directory called “Table1”. Using export
, we provide it with
a string that will be the name of the CSV containing the formatted
table.
table1(df, a, b, d, ifelse(a > 1, 1, 0),
splitby=~factor(c),
test=TRUE,
var_names = c("A", "B", "D", "New Var"),
format_number = TRUE,
export = "example_table1")
This can also be outputted as a latex, markdown, or pandoc table
(matching all the output types of knitr::kable
). Below
shows how to do a latex table (not using kable
however, but
a built-in function that provides the variable name at the top of the
table):
table1(df, a, b, d, "new var" = ifelse(a > 1, 1, 0),
splitby = ~factor(c),
test = TRUE,
output = "latex2")
## Breusch-Pagan Test of Heteroskedasticity suggests `var.equal = FALSE` in t.test() for: b
## \begin{table}[ ht ]
## \centering
## \caption{}\label{}
## \begin{tabular}{ l c c c }
## \toprule
## & \multicolumn{ 2 }{c}{ factor(c) }\\
## & control & treatment & P-Value \\
## & n = 470 & n = 430 & \\
## \midrule
## a & & & 0.365\\
## \hspace{6pt} & 1.4 (2.0) & 1.5 (2.1) & \\
## b & & & <.001\\
## \hspace{6pt} & 235.5 (135.8) & 685.5 (124.3) & \\
## d & & & 0.785\\
## \hspace{6pt} & 483.9 (288.2) & 478.6 (293.7) & \\
## new var & & & 0.438\\
## \hspace{6pt} & 0.6 (0.5) & 0.6 (0.5) & \\
## \bottomrule
##
## \end{tabular}
## \end{table}
Last item to show you regarding table1()
is that it can
be printed in a simplified and condensed form. This instead of reporting
counts and percentages for categorical variables, it reports only
percentages and the table has much less white space.
table1(df, a, b, d, "new var" = ifelse(a > 1, 1, 0),
splitby = ~factor(c),
test = TRUE,
type = c("simple", "condensed"))
## Breusch-Pagan Test of Heteroskedasticity suggests `var.equal = FALSE` in t.test() for: b
##
## ─────────────────────────────────────────────
## factor(c)
## control treatment P-Value
## n = 470 n = 430
## a 1.4 (2.0) 1.5 (2.1) 0.365
## b 235.5 (135.8) 685.5 (124.3) <.001
## d 483.9 (288.2) 478.6 (293.7) 0.785
## new var 0.6 (0.5) 0.6 (0.5) 0.438
## ─────────────────────────────────────────────
This function is to create simple, beautiful correlation tables. The
syntax is just like table1()
in most respects. Below we
include all the numeric variables to see their correlations. Since there
are missing values in d
we will use the natural
na.rm=TRUE
.
tableC(df,
a, b, d,na.rm = TRUE)
## N = 900
## Note: pearson correlation (p-value).
##
## ─────────────────────────────────────────
## [1] [2] [3]
## [1]a 1.00
## [2]b 0.04 (0.228) 1.00
## [3]d 0.052 (0.117) -0.025 (0.449) 1.00
## ─────────────────────────────────────────
All the adjustments that you can make in table1()
can be
done here as well. For example,
tableC(df,
"A" = a, "B" = b, "D" = d,
na.rm = TRUE,
output = "html")
## N = 900
## Note: pearson correlation (p-value).
[1] | [2] | [3] | |
---|---|---|---|
[1]A | 1.00 | ||
[2]B | 0.04 (0.228) | 1.00 | |
[3]D | 0.052 (0.117) | -0.025 (0.449) | 1.00 |
This function is to create simple frequency tables. The syntax is
just like table1()
and tableC()
in most
respects, except that it uses only one variable instead of many.
tableF(df, a)
##
## ────────────────────────────────────────────────
## a Freq CumFreq Percent CumPerc
## -5.28390783512797 1 1 0.10% 0.10%
## -4.1275792976316 1 2 0.10% 0.20%
## -3.93636952277543 1 3 0.10% 0.30%
## -3.89537271320155 1 4 0.10% 0.40%
## -3.84396881973322 1 5 0.10% 0.50%
## -3.83082897865278 1 6 0.10% 0.60%
## -3.76296728175994 1 7 0.10% 0.70%
## -3.4981266314135 1 8 0.10% 0.80%
## -3.31546691765433 1 9 0.10% 0.90%
## -3.25319297975575 1 10 0.10% 1.00%
## ... ... ... ... ...
## 6.36397408152279 1 990 0.10% 99.00%
## 6.41120020956892 1 991 0.10% 99.10%
## 6.58234682670721 1 992 0.10% 99.20%
## 6.65413313551019 1 993 0.10% 99.30%
## 6.72502311635413 1 994 0.10% 99.40%
## 6.82292455151611 1 995 0.10% 99.50%
## 7.01845327535352 1 996 0.10% 99.60%
## 7.28263257714545 1 997 0.10% 99.70%
## 7.34009250738556 1 998 0.10% 99.80%
## 7.47136489158915 1 999 0.10% 99.90%
## 7.54442914306183 1 1000 0.10% 100.00%
## ────────────────────────────────────────────────
Similarly to table1()
we can use a splitby
argument (or group_by()
).
tableF(df, d, splitby = c)
## Variable:d
##
## ─────────────────────────────────────────────────────
## control Freq CumFreq Percent CumPerc Valid CumValid
## 12 3 3 0.60% 0.60% 0.64% 0.64%
## 16 1 4 0.20% 0.80% 0.21% 0.85%
## 18 1 5 0.20% 1.00% 0.21% 1.06%
## 19 1 6 0.20% 1.20% 0.21% 1.28%
## 22 1 7 0.20% 1.40% 0.21% 1.49%
## 23 2 9 0.40% 1.80% 0.43% 1.91%
## 24 1 10 0.20% 2.00% 0.21% 2.13%
## 25 1 11 0.20% 2.20% 0.21% 2.34%
## 27 1 12 0.20% 2.40% 0.21% 2.55%
## 28 1 13 0.20% 2.60% 0.21% 2.77%
## ... ... ... ... ... ... ...
## 984 1 459 0.20% 91.80% 0.21% 97.66%
## 986 1 460 0.20% 92.00% 0.21% 97.87%
## 989 1 461 0.20% 92.20% 0.21% 98.09%
## 991 1 462 0.20% 92.40% 0.21% 98.30%
## 992 1 463 0.20% 92.60% 0.21% 98.51%
## 996 2 465 0.40% 93.00% 0.43% 98.94%
## 997 1 466 0.20% 93.20% 0.21% 99.15%
## 998 1 467 0.20% 93.40% 0.21% 99.36%
## 999 1 468 0.20% 93.60% 0.21% 99.57%
## 1000 2 470 0.40% 94.00% 0.43% 100.00%
## Missing 30 500 6.00% 100.00%
## ─────────────────────────────────────────────────────
##
## ───────────────────────────────────────────────────────
## treatment Freq CumFreq Percent CumPerc Valid CumValid
## 1 1 1 0.20% 0.20% 0.23% 0.23%
## 2 2 3 0.40% 0.60% 0.47% 0.70%
## 3 1 4 0.20% 0.80% 0.23% 0.93%
## 5 2 6 0.40% 1.20% 0.47% 1.40%
## 8 2 8 0.40% 1.60% 0.47% 1.86%
## 9 1 9 0.20% 1.80% 0.23% 2.09%
## 12 1 10 0.20% 2.00% 0.23% 2.33%
## 15 1 11 0.20% 2.20% 0.23% 2.56%
## 16 2 13 0.40% 2.60% 0.47% 3.02%
## 19 1 14 0.20% 2.80% 0.23% 3.26%
## ... ... ... ... ... ... ...
## 969 1 421 0.20% 84.20% 0.23% 97.91%
## 974 1 422 0.20% 84.40% 0.23% 98.14%
## 976 1 423 0.20% 84.60% 0.23% 98.37%
## 977 1 424 0.20% 84.80% 0.23% 98.60%
## 978 1 425 0.20% 85.00% 0.23% 98.84%
## 979 1 426 0.20% 85.20% 0.23% 99.07%
## 981 1 427 0.20% 85.40% 0.23% 99.30%
## 988 1 428 0.20% 85.60% 0.23% 99.53%
## 992 1 429 0.20% 85.80% 0.23% 99.77%
## 996 1 430 0.20% 86.00% 0.23% 100.00%
## Missing 70 500 14.00% 100.00%
## ───────────────────────────────────────────────────────
%>%
df group_by(c) %>%
tableF(d)
## Using dplyr::group_by() groups: c
## Variable:d
##
## ─────────────────────────────────────────────────────
## control Freq CumFreq Percent CumPerc Valid CumValid
## 12 3 3 0.60% 0.60% 0.64% 0.64%
## 16 1 4 0.20% 0.80% 0.21% 0.85%
## 18 1 5 0.20% 1.00% 0.21% 1.06%
## 19 1 6 0.20% 1.20% 0.21% 1.28%
## 22 1 7 0.20% 1.40% 0.21% 1.49%
## 23 2 9 0.40% 1.80% 0.43% 1.91%
## 24 1 10 0.20% 2.00% 0.21% 2.13%
## 25 1 11 0.20% 2.20% 0.21% 2.34%
## 27 1 12 0.20% 2.40% 0.21% 2.55%
## 28 1 13 0.20% 2.60% 0.21% 2.77%
## ... ... ... ... ... ... ...
## 984 1 459 0.20% 91.80% 0.21% 97.66%
## 986 1 460 0.20% 92.00% 0.21% 97.87%
## 989 1 461 0.20% 92.20% 0.21% 98.09%
## 991 1 462 0.20% 92.40% 0.21% 98.30%
## 992 1 463 0.20% 92.60% 0.21% 98.51%
## 996 2 465 0.40% 93.00% 0.43% 98.94%
## 997 1 466 0.20% 93.20% 0.21% 99.15%
## 998 1 467 0.20% 93.40% 0.21% 99.36%
## 999 1 468 0.20% 93.60% 0.21% 99.57%
## 1000 2 470 0.40% 94.00% 0.43% 100.00%
## Missing 30 500 6.00% 100.00%
## ─────────────────────────────────────────────────────
##
## ───────────────────────────────────────────────────────
## treatment Freq CumFreq Percent CumPerc Valid CumValid
## 1 1 1 0.20% 0.20% 0.23% 0.23%
## 2 2 3 0.40% 0.60% 0.47% 0.70%
## 3 1 4 0.20% 0.80% 0.23% 0.93%
## 5 2 6 0.40% 1.20% 0.47% 1.40%
## 8 2 8 0.40% 1.60% 0.47% 1.86%
## 9 1 9 0.20% 1.80% 0.23% 2.09%
## 12 1 10 0.20% 2.00% 0.23% 2.33%
## 15 1 11 0.20% 2.20% 0.23% 2.56%
## 16 2 13 0.40% 2.60% 0.47% 3.02%
## 19 1 14 0.20% 2.80% 0.23% 3.26%
## ... ... ... ... ... ... ...
## 969 1 421 0.20% 84.20% 0.23% 97.91%
## 974 1 422 0.20% 84.40% 0.23% 98.14%
## 976 1 423 0.20% 84.60% 0.23% 98.37%
## 977 1 424 0.20% 84.80% 0.23% 98.60%
## 978 1 425 0.20% 85.00% 0.23% 98.84%
## 979 1 426 0.20% 85.20% 0.23% 99.07%
## 981 1 427 0.20% 85.40% 0.23% 99.30%
## 988 1 428 0.20% 85.60% 0.23% 99.53%
## 992 1 429 0.20% 85.80% 0.23% 99.77%
## 996 1 430 0.20% 86.00% 0.23% 100.00%
## Missing 70 500 14.00% 100.00%
## ───────────────────────────────────────────────────────
Lastly, tableX()
is a pipe-able two-way version of
table()
with a similar syntax to that of the rest of the
furniture
functions.
%>%
df tableX(c, ifelse(d > 500, 1, 0))
## ifelse(d > 500, 1, 0)
## c 0 1 Missing Total
## control 241 229 30 500
## treatment 229 201 70 500
## Total 470 430 100 1000
By default, it provides the total counts for the rows and columns with flexibility as to what is displayed and where.
The four functions: table1()
, tableC()
,
tableF()
, and washer()
add simplicity to
cleaning up and understanding your data. Use these pieces of furniture
to make your quantitative life a bit easier.