This document mainly contains examples showing how best to use
summarytools in R Markdown documents. For a
more in-depth view of the package’s features, please see
vignette("introduction", "summarytools")
- the online
version can be found here.
Every time we display summarytools objects with
print()
, view()
, or stview()
, we
pick – explicitly or not – one of several display methods. Possible
display methods are: pander, render, viewer,
and browser.
Disambiguation
To avoid any confusion, here is a small digression on the word
method. It is a broad term which is also used in the OOP
(object oriented programming) lexicon to describe a special kind of
function that is linked to a specific class of objects. In R, the
print()
function is called a generic function. It
is generic because it takes and handles many type of objects as an
input. According to the class (or classes) of the object, it will
dispatch the object to the particular print method dedicated to
its class. In that sense, the print.summarytools()
function
is itself a method: objects of class “summarytools” are being
dispatched to it by the generic print()
function.
In this document, the term method refers to the
display method – not the OOP concept.
It is one of the parameters for print.summarytools()
,
view()
, and stview()
. Since methods
viewer and browser are mostly meant for interactive
work and rely on the same underlying code as render, we will
assume for the purpose of this document that there are really only two
methods: pander and render.
The pander method is used by default when results are
automatically printed to the console, or when we use
print()
without an explicit method
argument.
The style parameter is communicated to
pander (see ?pander::pander
or visit its
GitHub page to learn
more on this very useful package).
When we use any of the viewer, browser, or render methods, the package rather uses htmltools to generate results; any specified styles are thus ignored. |
Available styles are the ones supported by pander:
dfSummary()
)dfSummary()
)dfSummary()
if you want
ASCII graphs only)Always set results=‘asis’ either
explicitly on a chunk-by-chunk bases or by including
opts_chunk$set(results = 'asis')
in your setup chunk.
Also, don’t forget to specify
plain.ascii = FALSE
in all function calls
using the pander method. It is advised to set this option, as
well as the style
option in the setup chunk:
st_options(plain.ascii = FALSE, style = "rmarkdown")
or include st_options(plain.ascii = FALSE)
in your setup
chunk.
If you get repeated, unhelpful warnings, use chunk options
message = FALSE and/or warning = FALSE .
|
The following table indicates which method / style is better suited for each summarytools function in the context of R Markdown documents:
Function | render method | pander method | pander style |
---|---|---|---|
freq() | ✓ | ✓ | rmarkdown |
ctable() | ✓ | Sub-optimal | rmarkdown |
descr() | ✓ | ✓ | rmarkdown |
dfSummary() | ✓ | ✓ | grid |
Recommended Style When Using pander method
For freq()
, descr()
, and
ctable()
, rmarkdown style is recommended. For
dfSummary()
, grid is recommended. Note that
‘multiline’ can also be used, but only ASCII graphs will be
displayed.
Starting with freq()
, we’ll now review the recommended
methods and styles to get satisfying results in R Markdown
documents.
freq()
is best used with `style = ‘rmarkdown’; html
rendering is also possible.
With method="pander"
, “rmarkdown” is the easy
winner.
freq(tobacco$gender, plain.ascii = FALSE, style = 'rmarkdown')
explicit NA's detected - temporarily setting 'report.nas' to FALSE
tobacco$gender
Type: Facteur
N | % | % Cum. | |
---|---|---|---|
F | 489 | 48.90 | 48.90 |
M | 489 | 48.90 | 97.80 |
(Missing) | 22 | 2.20 | 100.00 |
Total | 1000 | 100.00 | 100.00 |
There are rarely any problems when using the render method
to display freq()
results.
print(freq(tobacco$gender), method = 'render')
explicit NA's detected - temporarily setting 'report.nas' to FALSE
gender | N | % | % Cum. |
---|---|---|---|
F | 489 | 48.90 | 48.90 |
M | 489 | 48.90 | 97.80 |
(Missing) | 22 | 2.20 | 100.00 |
Total | 1000 | 100.00 | 100.00 |
If you find the table too large, you can use
table.classes = 'st-small'
:
print(descr(tobacco), method = 'render', table.classes = 'st-small')
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Moy | 25.73 | 49.60 | 6.78 | 1.00 |
Écart-type | 4.49 | 18.29 | 11.88 | 0.08 |
Min | 8.83 | 18.00 | 0.00 | 0.86 |
Q1 | 22.93 | 34.00 | 0.00 | 0.86 |
Médiane | 25.62 | 50.00 | 0.00 | 1.04 |
Q3 | 28.65 | 66.00 | 11.00 | 1.05 |
Max | 39.44 | 80.00 | 40.00 | 1.06 |
ÉMA | 4.18 | 23.72 | 0.00 | 0.01 |
ÉIQ | 5.72 | 32.00 | 11.00 | 0.19 |
CV | 0.17 | 0.37 | 1.75 | 0.08 |
Asymétrie | 0.02 | -0.04 | 1.54 | -1.04 |
ET-Asymétrie | 0.08 | 0.08 | 0.08 | 0.08 |
Aplatissement | 0.26 | -1.26 | 0.90 | -0.90 |
Nb.Valide | 974 | 975 | 965 | 1000 |
Pct.Valide | 97.40 | 97.50 | 96.50 | 100.00 |
Tables with heading spanning over 2 rows are not fully supported in markdown (yet), but the result is getting close to acceptable. This, however, is not true for all themes. That is why the rendering method is preferred.
ctable(tobacco$gender,
tobacco$smoker,
plain.ascii = FALSE,
style = 'rmarkdown')
gender * smoker
Data frame: tobacco
smoker | Yes | No | Total | |
gender | ||||
F | 147 (30.1%) | 342 (69.9%) | 489 (100.0%) | |
M | 143 (29.2%) | 346 (70.8%) | 489 (100.0%) | |
(Missing) | 8 (36.4%) | 14 (63.6%) | 22 (100.0%) | |
Total | 298 (29.8%) | 702 (70.2%) | 1000 (100.0%) |
For best results, use this method.
print(ctable(tobacco$gender, tobacco$smoker), method = 'render')
smoker | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
gender | Yes | No | Total | |||||||||
F | 147 | ( | 30.1% | ) | 342 | ( | 69.9% | ) | 489 | ( | 100.0% | ) |
M | 143 | ( | 29.2% | ) | 346 | ( | 70.8% | ) | 489 | ( | 100.0% | ) |
(Missing) | 8 | ( | 36.4% | ) | 14 | ( | 63.6% | ) | 22 | ( | 100.0% | ) |
Total | 298 | ( | 29.8% | ) | 702 | ( | 70.2% | ) | 1000 | ( | 100.0% | ) |
descr()
gives good results with both
style = 'rmarkdown'
and HTML rendering.
descr(tobacco, plain.ascii = FALSE, style = 'rmarkdown')
tobacco
N: 1000
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Moy | 25.73 | 49.60 | 6.78 | 1.00 |
Écart-type | 4.49 | 18.29 | 11.88 | 0.08 |
Min | 8.83 | 18.00 | 0.00 | 0.86 |
Q1 | 22.93 | 34.00 | 0.00 | 0.86 |
Médiane | 25.62 | 50.00 | 0.00 | 1.04 |
Q3 | 28.65 | 66.00 | 11.00 | 1.05 |
Max | 39.44 | 80.00 | 40.00 | 1.06 |
ÉMA | 4.18 | 23.72 | 0.00 | 0.01 |
ÉIQ | 5.72 | 32.00 | 11.00 | 0.19 |
CV | 0.17 | 0.37 | 1.75 | 0.08 |
Asymétrie | 0.02 | -0.04 | 1.54 | -1.04 |
ET-Asymétrie | 0.08 | 0.08 | 0.08 | 0.08 |
Aplatissement | 0.26 | -1.26 | 0.90 | -0.90 |
Nb.Valide | 974.00 | 975.00 | 965.00 | 1000.00 |
Pct.Valide | 97.40 | 97.50 | 96.50 | 100.00 |
We’ll use table.classes = ‘st-small’ to show how it affects the
table’s size, compared to the freq()
table rendered
earlier.
We’ll also use message = FALSE
as chunk option to avoid
the message saying that non-numerical variables have been ignored.
print(descr(tobacco), method = 'render', table.classes = 'st-small')
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Moy | 25.73 | 49.60 | 6.78 | 1.00 |
Écart-type | 4.49 | 18.29 | 11.88 | 0.08 |
Min | 8.83 | 18.00 | 0.00 | 0.86 |
Q1 | 22.93 | 34.00 | 0.00 | 0.86 |
Médiane | 25.62 | 50.00 | 0.00 | 1.04 |
Q3 | 28.65 | 66.00 | 11.00 | 1.05 |
Max | 39.44 | 80.00 | 40.00 | 1.06 |
ÉMA | 4.18 | 23.72 | 0.00 | 0.01 |
ÉIQ | 5.72 | 32.00 | 11.00 | 0.19 |
CV | 0.17 | 0.37 | 1.75 | 0.08 |
Asymétrie | 0.02 | -0.04 | 1.54 | -1.04 |
ET-Asymétrie | 0.08 | 0.08 | 0.08 | 0.08 |
Aplatissement | 0.26 | -1.26 | 0.90 | -0.90 |
Nb.Valide | 974 | 975 | 965 | 1000 |
Pct.Valide | 97.40 | 97.50 | 96.50 | 100.00 |
To get optimal results, whichever method you choose, it is always
best to omit at least 1, and if possible 2 columns from the output.
Also, pick carefully the value of the graph.magnig
parameter.
Don’t forget to specify plain.ascii = FALSE
(or set it
as a global option with st_options(plain.ascii = FALSE)
),
or you won’t get good results. (Note that to avoid problems when
uploading the package, the following is an imagine, not the actual
rendering from this piece of code. This is because CRAN doesn’t allow
the writing in /tmp or any other directory, except in R’s temp
directory, for good reasons.)
dfSummary(tobacco,
plain.ascii = FALSE,
style = 'grid',
graph.magnif = 0.85,
varnumbers = FALSE,
valid.col = FALSE,
tmp.img.dir = "/tmp")
This method works really well, and not having to specify the
tmp.img.dir
parameter is a plus.
print(dfSummary(tobacco,
varnumbers = FALSE,
valid.col = FALSE,
graph.magnif = 0.76),
method = 'render')
Variable | Stats / valeurs | Fréq. (% de valide) | Diagramme | Manquant | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
age [numeric] |
|
63 valeurs uniques | 25 (2.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
age.gr [factor] |
|
|
25 (2.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BMI [numeric] |
|
974 valeurs uniques | 26 (2.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
smoker [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
cigs.per.day [numeric] |
|
37 valeurs uniques | 35 (3.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
diseased [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
disease [character] |
|
|
778 (77.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
samp.wgts [numeric] |
|
|
0 (0.0%) |
For data frames containing numerous variables, we can use the
max.tbl.height
argument to wrap the results in a scrollable
window having the specified height, in pixels.
print(dfSummary(tobacco,
varnumbers = FALSE,
valid.col = FALSE,
graph.magnif = 0.76),
max.tbl.height = 300,
method = "render")
Variable | Stats / valeurs | Fréq. (% de valide) | Diagramme | Manquant | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gender [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
age [numeric] |
|
63 valeurs uniques | 25 (2.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
age.gr [factor] |
|
|
25 (2.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BMI [numeric] |
|
974 valeurs uniques | 26 (2.6%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
smoker [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
cigs.per.day [numeric] |
|
37 valeurs uniques | 35 (3.5%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
diseased [factor] |
|
|
0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
disease [character] |
|
|
778 (77.8%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
samp.wgts [numeric] |
|
|
0 (0.0%) |
Some users reported getting repeated X11 warnings; those can easily be
avoided by using the following chunk expression:
{r, results="asis", warning=FALSE} .
|
As explained in the introductory vignette, tb()
can be
used to convert summarytools objects created with
freq()
and descr()
to simple tibbles
that packages specialized in table formatting will be able to process.
This is particularly helpful with stby
objects:
library(kableExtra)
library(magrittr)
stby(iris, iris$Species, descr, stats = "fivenum") %>%
tb() %>%
kable(format = "html", digits = 2) %>%
collapse_rows(columns = 1, valign = "top")
Species | variable | min | q1 | med | q3 | max |
---|---|---|---|---|---|---|
setosa | Petal.Length | 1.0 | 1.4 | 1.50 | 1.6 | 1.9 |
setosa | Petal.Width | 0.1 | 0.2 | 0.20 | 0.3 | 0.6 |
setosa | Sepal.Length | 4.3 | 4.8 | 5.00 | 5.2 | 5.8 |
setosa | Sepal.Width | 2.3 | 3.2 | 3.40 | 3.7 | 4.4 |
versicolor | Petal.Length | 3.0 | 4.0 | 4.35 | 4.6 | 5.1 |
versicolor | Petal.Width | 1.0 | 1.2 | 1.30 | 1.5 | 1.8 |
versicolor | Sepal.Length | 4.9 | 5.6 | 5.90 | 6.3 | 7.0 |
versicolor | Sepal.Width | 2.0 | 2.5 | 2.80 | 3.0 | 3.4 |
virginica | Petal.Length | 4.5 | 5.1 | 5.55 | 5.9 | 6.9 |
virginica | Petal.Width | 1.4 | 1.8 | 2.00 | 2.3 | 2.5 |
virginica | Sepal.Length | 4.9 | 6.2 | 6.50 | 6.9 | 7.9 |
virginica | Sepal.Width | 2.2 | 2.8 | 3.00 | 3.2 | 3.8 |
Using tb(order = 3)
flips the order of the grouping
variable(s) and the reported variable(s):
stby(iris, iris$Species, descr, stats = "fivenum") %>%
tb(order = 3) %>%
kable(format = "html", digits = 2) %>%
collapse_rows(columns = 1, valign = "top")
variable | Species | min | q1 | med | q3 | max |
---|---|---|---|---|---|---|
Petal.Length | setosa | 1.0 | 1.4 | 1.50 | 1.6 | 1.9 |
Petal.Length | versicolor | 3.0 | 4.0 | 4.35 | 4.6 | 5.1 |
Petal.Length | virginica | 4.5 | 5.1 | 5.55 | 5.9 | 6.9 |
Petal.Width | setosa | 0.1 | 0.2 | 0.20 | 0.3 | 0.6 |
Petal.Width | versicolor | 1.0 | 1.2 | 1.30 | 1.5 | 1.8 |
Petal.Width | virginica | 1.4 | 1.8 | 2.00 | 2.3 | 2.5 |
Sepal.Length | setosa | 4.3 | 4.8 | 5.00 | 5.2 | 5.8 |
Sepal.Length | versicolor | 4.9 | 5.6 | 5.90 | 6.3 | 7.0 |
Sepal.Length | virginica | 4.9 | 6.2 | 6.50 | 6.9 | 7.9 |
Sepal.Width | setosa | 2.3 | 3.2 | 3.40 | 3.7 | 4.4 |
Sepal.Width | versicolor | 2.0 | 2.5 | 2.80 | 3.0 | 3.4 |
Sepal.Width | virginica | 2.2 | 2.8 | 3.00 | 3.2 | 3.8 |
Here is a recipe for including fully formatted data frame summaries in pdf documents. There is some work involved, but following the instructions given here should give the expected results.
There are basically two parts to this: first, you must create a preamble tex file. Second, you must indicate in the YAML section of your document where to find this file.
This is the content that needs to be included as preamble. You can either copy this into your own tex file, or use the file that is now included in summarytools (as of version 1.0), following the instructions provided below.
\usepackage{graphicx}
\usepackage[export]{adjustbox}
\usepackage{letltxmacro}
\LetLtxMacro{\OldIncludegraphics}{\includegraphics}
\renewcommand{\includegraphics}[2][]{\raisebox{0.5\height}%
{\OldIncludegraphics[valign=t,#1]{#2}}}
If you choose to create a tex file from the above content, the name of the file is arbitrary – you can use whatever name you want. Its location is also up to you. I suggest you put it in the same location as your Rmd file.
Along with the graph.magnif
parameter for
dfSummary()
, you might need to adjust the 0.5 value used as
raisebox
parameter in the preamble.
Your document should start with a YAML header like this one, supposing the preamble tex file is in the same location as your Rmd document :
---
title: "My PDF With Data Frame Summaries"
output:
pdf_document:
latex_engine: xelatex
includes:
in_header:
- !expr system.file("includes/fig-valign.tex", package = "summarytools")
---
If you need to customize the content of the preamble, then your header will look something like this:
---
title: "My PDF With Data Frame Summaries"
output:
pdf_document:
latex_engine: xelatex
includes:
in_header: fig-valign-modified.tex
---
The xelatex engine option is not mandatory, but there are several advantages to it. I use it systematically and recommend you do the same. |
Here is an example setup chunk:
```{r, message=FALSE}
library(summarytools)
st_options(
plain.ascii = FALSE,
style = "rmarkdown",
dfSummary.style = "grid",
dfSummary.valid.col = FALSE,
dfSummary.graph.magnif = .52,
subtitle.emphasis = FALSE,
tmp.img.dir = "/tmp"
)
```
And here is a chunk actually creating the summary:
```{r, results='asis', message=FALSE}
define_keywords(title.dfSummary = "Data Frame Summary in PDF Format")
dfSummary(tobacco)
```
Since we redefined the \(\LaTeX\)
command includegraphics
, all images included using
[](some-image.png)
will be impacted. In some cases this
will likely be problematic. Eventually we will find a more robust
solution without such undesired side-effects. If you are well versed in
\(\LaTeX\) and think you can solve this
problem, please get in touch.
This vignette uses theme rmarkdown::html_vignette
. Its
YAML section looks like this:
---
title: "Summarytools in R Markdown Documents"
author: "Dominic Comtois"
date: "2022-05-19"
output:
html_document:
fig_caption: false
toc: true
toc_depth: 1
css: assets/vignette.css
vignette: >
%\VignetteIndexEntry{Summarytools in R Markdown Documents}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{magrittr}
%\VignetteDepends{kableExtra}
---
The vignette.css file is copied from the installed rmarkdown package’s ‘templates/html_vignette/resources’ directory.
The following global options for knitr and summarytools have been set. Other options might also be useful to optimize content, but this is a good place to start from.
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(comment=NA,
prompt=FALSE,
cache=FALSE,
echo=TRUE,
results='asis')
st_options(bootstrap.css = FALSE, # Already part of the theme
plain.ascii = FALSE, # Essential setting for Rmd
style = "rmarkdown", # Essential setting for Rmd
dfSummary.silent = TRUE, # Hides redundant messages
footnote = NA, # Keeping the results minimal
subtitle.emphasis = FALSE) # For the vignette theme,
# this gives better results.
# For other themes, using
# TRUE might be preferable.
```
Finally, summarytools CSS has been included in the following manner:
```{r, echo=FALSE}
st_css(main = TRUE, global = TRUE)
```
This is by no way a definitive guide; depending on the themes you use, you could find that other settings yield better results. If you are looking to create a Word or a PDF document, you might want to try different combinations of options. If you find problems with the recommended settings or if you find better combinations, you are welcome to open an issue on GitHub to suggest modifications or make a pull request with your own improvements to this vignette.