Extending apa_print()

Frederik Aust

library("papaja")
#> Lade nötiges Paket: tinylabels

Here, we provide a brief overview of issues to consider when implementing a new method for apa_print(), a convenience function to facilitate reporting of results in accordance with APA reporting guidelines. If you consider adding a new method, please review our brief contributing guidelines and code of conduct.

Deciding on a new method

If you are reporting the results of a statistical analysis that is not yet supported by apa_print() you probably have a good motivation and possibly prior work to build on. If you are just looking for a way to contribute to, take a look at the open issues for inspiration.

apa_print() is a generic, meaning it can, in principle, work on any output object with a class that is specific enough to purposefully extract the results of the analysis. For example, objects of class htest, as returned by t.test(), cor.test(), prop.test(), etc., are named lists that follow a loose convention about the named objects they contain.

t_test_example <- t.test(extra ~ group, data = sleep)

class(t_test_example)
#> [1] "htest"

str(t_test_example)
#> List of 10
#>  $ statistic  : Named num -1.86
#>   ..- attr(*, "names")= chr "t"
#>  $ parameter  : Named num 17.8
#>   ..- attr(*, "names")= chr "df"
#>  $ p.value    : num 0.0794
#>  $ conf.int   : num [1:2] -3.365 0.205
#>   ..- attr(*, "conf.level")= num 0.95
#>  $ estimate   : Named num [1:2] 0.75 2.33
#>   ..- attr(*, "names")= chr [1:2] "mean in group 1" "mean in group 2"
#>  $ null.value : Named num 0
#>   ..- attr(*, "names")= chr "difference in means between group 1 and group 2"
#>  $ stderr     : num 0.849
#>  $ alternative: chr "two.sided"
#>  $ method     : chr "Welch Two Sample t-test"
#>  $ data.name  : chr "extra by group"
#>  - attr(*, "class")= chr "htest"

Hence, if we pass an htest object to apa_print() the function expects there to be named elements in the list, such as statistic, estimate, or p.value. These expectations are reflected in the workings of the apa_print.htest() method. Objects of less specific classes, such as list or data.frame cannot be supported, because we cannot make any useful assumptions about their structure.

Default structure of output

Objects returned by apa_print() are of class apa_results, a named list with four elements:

#> $estimate
#> NULL
#> 
#> $statistic
#> NULL
#> 
#> $full_result
#> NULL
#> 
#> $table
#> NULL
#> 
#> attr(,"class")
#> [1] "apa_results" "list"

To illustrate how apa_results objects are populated, let’s look at the output of apa_print.lm().

# Data from Dobson (1990), p. 9.
ctl <- c(4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14)
trt <- c(4.81, 4.17, 4.41, 3.59, 5.87, 3.83, 6.03, 4.89, 4.32, 4.69)
group <- gl(2, 10, 20, labels = c("Ctl", "Trt"))
weight <- c(ctl, trt)
lm_fit <- lm(weight ~ group)

lm_fit_apa <- apa_print(lm_fit)

The estimate element of the returned apa_results-list itself contains a named list with estimated parameters—in this case regression coefficients—and corresponding confidence intervals for the model. The names of the list correspond to the names of the predictors.

lm_fit_apa$estimate
#> $Intercept
#> [1] "$b = 5.03$, 95\\% CI $[4.57, 5.49]$"
#> 
#> $groupTrt
#> [1] "$b = -0.37$, 95\\% CI $[-1.03, 0.28]$"
#> 
#> $modelfit
#> $modelfit$r2
#> [1] "$R^2 = .07$, 90\\% CI $[0.00, 0.33]$"
#> 
#> $modelfit$r2_adj
#> [1] "$R^2_{adj} = .02$"
#> 
#> $modelfit$aic
#> [1] "$\\mathrm{AIC} = 46.18$"
#> 
#> $modelfit$bic
#> [1] "$\\mathrm{BIC} = 49.16$"

The estimate list may contain additional elements, such as the list modelfit, that contains quantitative estimates of the model fit.

The statistic element of the returned apa_results list contains a named list with the same structure as estimate. Instead of parameter estimates, statistic contains the corresponding inferential test statistics, such as significance tests or Bayesian model comparisons.

lm_fit_apa$statistic
#> $Intercept
#> [1] "$t(18) = 22.85$, $p < .001$"
#> 
#> $groupTrt
#> [1] "$t(18) = -1.19$, $p = .249$"
#> 
#> $modelfit
#> $modelfit$r2
#> [1] "$F(1, 18) = 1.42$, $p = .249$"

Note that the statistics list misses elements for the information criteria AIC and BIC. Because no inferential test statistics on the information criteria are available, it is fine to simply drop those elements.

The full_results element is a named list that simply combines the results of estimate and statistic for convenience in reporting.

lm_fit_apa$full_result
#> $Intercept
#> [1] "$b = 5.03$, 95\\% CI $[4.57, 5.49]$, $t(18) = 22.85$, $p < .001$"
#> 
#> $groupTrt
#> [1] "$b = -0.37$, 95\\% CI $[-1.03, 0.28]$, $t(18) = -1.19$, $p = .249$"
#> 
#> $modelfit
#> $modelfit$r2
#> [1] "$R^2 = .07$, 90\\% CI $[0.00, 0.33]$, $F(1, 18) = 1.42$, $p = .249$"

Finally, the table element contains a data.frame of class apa_results_table that summarizes the results. In essence this is simply a regular data.frame that follows the column-naming conventions used in broom but allows for prettier printing of variable labels.

lm_fit_apa$table
#> A data.frame with 6 labelled columns:
#> 
#>        term estimate      conf.int statistic df p.value
#> 1 Intercept     5.03  [4.57, 5.49]     22.85 18  < .001
#> 2  GroupTrt    -0.37 [-1.03, 0.28]     -1.19 18    .249
#> 
#> term     : Predictor 
#> estimate : $b$ 
#> conf.int : 95\\% CI 
#> statistic: $t$ 
#> df       : $\\mathit{df}$ 
#> p.value  : $p$

For more complex analyses table may contain a named list of apa_result_tables. We use tinylabels to set variable labels. These variable labels are attributes attached to each column and contain a typeset label for the respective column.

# library("tinylabels")

letters
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"

variable_label(letters) <- "Letters of the alphabet"
variable_label(letters)
#> [1] "Letters of the alphabet"

letters
#> Variable label     : Letters of the alphabet
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"

str(letters)
#>  'tiny_labelled' chr [1:26] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" ...
#>  - attr(*, "label")= chr "Letters of the alphabet"
lm_fit_apa$table$statistic
#> Variable label     : $t$
#> [1] "22.85" "-1.19"

Variable labels are automatically used by apa_table() and plotting functions from the apa_factorial_plot()-family to create sensible default labels. If a label is enveloped in $ it may contain LaTeX math syntax, which is automatically converted to R expressions using latex2exp for plotting.

Any new apa_print() method should output an object of this basic structure.

Typesetting numeric information

apa_results do not contain numeric information. Rather the numeric information has been processed for printing in accordance with APA guidelines. There are several papaja-functions to facilitate the typesetting. apa_num() is a flexible general purpose function that wraps formatC() and can be used to round, set decimal as well as thousands separators, or remove leading zeros.

x <- rnorm(3) * 1e4
apa_num(x)
#> [1] "-17,402.11" "-3,444.68"  "-14,935.10"

apa_num(x, digits = 3, big.mark = ".", decimal.mark = ",")
#> [1] "-17.402,114" "-3.444,676"  "-14.935,103"

apa_num(Inf)
#> [1] "$\\infty$"

apa_p() is a wrapper for apa_num() that sets appropriate defaults to report p values in accordance with APA guidelines.

apa_p(c(0.0001, 0.05, 0.99999))
#> [1] "< .001" ".050"   "> .999"

The internal function apa_df() is geared towards typesetting degrees of freedom.

apa_df(c(12, 12.485))
#> [1] "12"    "12.48"

apa_df(12L)
#> [1] "12"

Finally, apa_interval() can be used to typeset interval estimates.

apa_interval(rnorm(2), conf.int = 0.95, interval_type = "CI")
#> [1] "95\\% CI [-0.24, -0.38]"

Again, there are two wrappers that set appropriate defaults to typeset frequentist confidence intervals and Bayesian highest-density intervals.

apa_confint(rnorm(2), conf.int = 0.95)
#> [1] "95\\% CI [-0.63, 0.47]"

apa_hdint(rnorm(2), conf.int = 0.95)
#> [1] "95\\% HDI [0.96, -1.50]"

Typesetting model terms

When creating named lists from terms, these terms names should use _ as separator, and be valid R names. Adhering to these conventions ensures that apa_results can conveniently be indexed using the $ operator.

To facilitate the generation of list names, papaja provides the internal function sanitize_terms().

mod_terms <- c("(Intercept)", "Factor A", "Factor B",
               "Factor A:Factor B", "scale(Factor A)")
sanitize_terms(mod_terms, standardized = TRUE)
#> [1] "Intercept"         "Factor_A"          "Factor_B"         
#> [4] "Factor_A_Factor_B" "z_Factor_A"

While these sanitized terms are well suited to name R objects, they are not ideal for reporting. To facilitate typesetting term names for reporting, there is another internal function beautify_terms().

beautify_terms(mod_terms, standardized = TRUE)
#> [1] "Intercept"                   "Factor A"                   
#> [3] "Factor B"                    "Factor A $\\times$ Factor B"
#> [5] "Factor A"

Internal workflow

Method dispatch

As with lm objects, it is often the case that the objects, as returned by the analysis function, may not contain all information necessary to populate the lists described above. For example, to obtain inferential statistics it may be necessary to call summary().

npk_aov <- aov(yield ~ block + N * P * K, npk)
npk_aov
#> Call:
#>    aov(formula = yield ~ block + N * P * K, data = npk)
#> 
#> Terms:
#>                    block        N        P        K      N:P      N:K      P:K
#> Sum of Squares  343.2950 189.2817   8.4017  95.2017  21.2817  33.1350   0.4817
#> Deg. of Freedom        5        1        1        1        1        1        1
#>                 Residuals
#> Sum of Squares   185.2867
#> Deg. of Freedom        12
#> 
#> Residual standard error: 3.929447
#> 1 out of 13 effects not estimable
#> Estimated effects may be unbalanced

summary(npk_aov)
#>             Df Sum Sq Mean Sq F value  Pr(>F)   
#> block        5  343.3   68.66   4.447 0.01594 * 
#> N            1  189.3  189.28  12.259 0.00437 **
#> P            1    8.4    8.40   0.544 0.47490   
#> K            1   95.2   95.20   6.166 0.02880 * 
#> N:P          1   21.3   21.28   1.378 0.26317   
#> N:K          1   33.1   33.14   2.146 0.16865   
#> P:K          1    0.5    0.48   0.031 0.86275   
#> Residuals   12  185.3   15.44                   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is why there are usually multiple apa_print()-methods that are called subsequently to make the function both flexible and convenient. For convenience, apa_print.aov() calls summary() with its default arguments and passes the result onto apa_print.summary.aov().

papaja:::apa_print.aov
#> function (x, estimate = getOption("papaja.estimate_anova", "ges"), 
#>     observed = NULL, intercept = FALSE, mse = TRUE, in_paren = FALSE, 
#>     ...) 
#> {
#>     apa_print(summary(x), .x = x, intercept = intercept, estimate = estimate, 
#>         mse = mse, observed = observed, in_paren = in_paren, 
#>         ...)
#> }
#> <bytecode: 0x7fc10c899ac0>
#> <environment: namespace:papaja>

This approach also ensures that a variety of object types are supported while minimizing code redundancy.

Restructuring results

The internals of apa_print() heavily rely on broom, a package to conveniently restructure the output of analysis functions into tidy data.frames. The objects are often processed using broom::tidy(), and broom::glance() if necessary, before being modified further to create the contents of the table element.

Once the results table has been assembled, numeric values have been typeset, and variable labels have been assigned glue_apa_results() can be used to create an apa_results object according to the above specifications. Consider the following example of an lm-object. First we tidy() and glance() the object to obtain tidy results. We than typeset all “special” numerical results, that is, all results that would not be typeset appropriately by applying apa_num() with its default settings. Moreover, we combine the separate columns for lower and upper confidence interval bounds into one column conf.int which contains the complete confidence interval.

lm_fit <- lm(mpg ~ cyl + wt, mtcars)

# Tidy and typeset output
library("broom")
tidy_lm_fit <- tidy(lm_fit, conf.int = TRUE)
tidy_lm_fit$p.value <- apa_p(tidy_lm_fit$p.value)
tidy_lm_fit$conf.int <- unlist(apa_confint(tidy_lm_fit[, c("conf.low", "conf.high")]))

str(tidy_lm_fit)
#> tibble [3 × 8] (S3: tbl_df/tbl/data.frame)
#>  $ term     : chr [1:3] "(Intercept)" "cyl" "wt"
#>  $ estimate : num [1:3] 39.69 -1.51 -3.19
#>  $ std.error: num [1:3] 1.715 0.415 0.757
#>  $ statistic: num [1:3] 23.14 -3.64 -4.22
#>  $ p.value  : chr [1:3] "< .001" ".001" "< .001"
#>  $ conf.low : num [1:3] 36.18 -2.36 -4.74
#>  $ conf.high: num [1:3] 43.19 -0.66 -1.64
#>  $ conf.int : chr [1:3] "[36.18, 43.19]" "[-2.36, -0.66]" "[-4.74, -1.64]"

glance_lm_fit <- glance(lm_fit)
glance_lm_fit$r.squared <- apa_num(glance_lm_fit$r.squared, gt1 = FALSE)
glance_lm_fit$p.value <- apa_p(glance_lm_fit$p.value)
glance_lm_fit$df <- apa_df(glance_lm_fit$df)
glance_lm_fit$df.residual <- apa_df(glance_lm_fit$df.residual)

str(glance_lm_fit)
#> tibble [1 × 12] (S3: tbl_df/tbl/data.frame)
#>  $ r.squared    : chr ".83"
#>  $ adj.r.squared: num 0.819
#>  $ sigma        : num 2.57
#>  $ statistic    : Named num 70.9
#>   ..- attr(*, "names")= chr "value"
#>  $ p.value      : chr "< .001"
#>  $ df           : chr "2"
#>  $ logLik       : num -74
#>  $ AIC          : num 156
#>  $ BIC          : num 162
#>  $ deviance     : num 191
#>  $ df.residual  : chr "29"
#>  $ nobs         : int 32

Next, we typeset the remaining numeric columns and assign informative variable labels:

tidy_lm_fit <- apa_num(tidy_lm_fit)

variable_labels(tidy_lm_fit) <- c(
  term = "Term"
  , estimate = "$b$"
  , statistic = paste0("$t(", glance_lm_fit$df.residual, ")")
  , p.value = "$p$"
  , conf.int = "95% CI"
)

glance_lm_fit <- apa_num(glance_lm_fit)

variable_labels(glance_lm_fit) <- c(
  r.squared = "$R^2$"
  , statistic = "$F$"
  , p.value = "$p$"
  , AIC = "$\\mathrm{AIC}$"
)

Now we can use glue_apa_results() to create the output object. In doing so, we use the internal function construct_glue() to automatically determine the correct “glue” of the reporting string. Let’s first examine the glue.

papaja:::construct_glue(tidy_lm_fit, "estimate")
#> [1] "$<<svl(estimate)>> = <<estimate>>$, <<svl(conf.int, use_math = TRUE)>> $<<conf.int>>$"

The character string contains a combination of text and to-be-evaluated R code enveloped in << and >>. All variable names (e.g. estimate) are assumed to be columns of x (here tidy_lm_fit) or any additional object passed to glue_apa_results() via .... svl() is a function that returns a column variable label but, by default, remove the math-environment tags ($) as these are not needed here.

lm_results <- glue_apa_results(
    x = tidy_lm_fit
    , est_glue = papaja:::construct_glue(tidy_lm_fit, "estimate")
    , stat_glue = papaja:::construct_glue(tidy_lm_fit, "statistic")
    , term_names = sanitize_terms(tidy_lm_fit$term)
)

lm_results
#> $estimate
#> $estimate$Intercept
#> [1] "$b = 39.69$, 95% CI $[36.18, 43.19]$"
#> 
#> $estimate$cyl
#> [1] "$b = -1.51$, 95% CI $[-2.36, -0.66]$"
#> 
#> $estimate$wt
#> [1] "$b = -3.19$, 95% CI $[-4.74, -1.64]$"
#> 
#> 
#> $statistic
#> $statistic$Intercept
#> [1] "$t(29) = 23.14$, $p < .001$"
#> 
#> $statistic$cyl
#> [1] "$t(29) = -3.64$, $p = .001$"
#> 
#> $statistic$wt
#> [1] "$t(29) = -4.22$, $p < .001$"
#> 
#> 
#> $full_result
#> $full_result$Intercept
#> [1] "$b = 39.69$, 95% CI $[36.18, 43.19]$, $t(29) = 23.14$, $p < .001$"
#> 
#> $full_result$cyl
#> [1] "$b = -1.51$, 95% CI $[-2.36, -0.66]$, $t(29) = -3.64$, $p = .001$"
#> 
#> $full_result$wt
#> [1] "$b = -3.19$, 95% CI $[-4.74, -1.64]$, $t(29) = -4.22$, $p < .001$"
#> 
#> 
#> $table
#>        term estimate std.error statistic p.value conf.low conf.high
#> 1 Intercept    39.69      1.71     23.14  < .001    36.18     43.19
#> 2       Cyl    -1.51      0.41     -3.64    .001    -2.36     -0.66
#> 3        Wt    -3.19      0.76     -4.22  < .001    -4.74     -1.64
#>         conf.int
#> 1 [36.18, 43.19]
#> 2 [-2.36, -0.66]
#> 3 [-4.74, -1.64]
#> 
#> attr(,"class")
#> [1] "apa_results" "list"

If we need to add additional information to this output, we can use add_glue_to_apa_results(). This function takes an existing output and adds new strings to a specific sublist. So, let’s add some model fit information to the output.

add_glue_to_apa_results(
  .x = glance_lm_fit
  , container = lm_results
  , sublist = "modelfit"
  , est_glue = c(
      r2 = "$<<svl(r.squared)>> = <<r.squared>>$"
      , aic = ""
  )
  , stat_glue = c(
      r2 = papaja:::construct_glue(glance_lm_fit, "statistic")
      , aic = "$<<svl(AIC)>> = <<AIC>>$"
  )
)
#> $estimate
#> $estimate$Intercept
#> [1] "$b = 39.69$, 95% CI $[36.18, 43.19]$"
#> 
#> $estimate$cyl
#> [1] "$b = -1.51$, 95% CI $[-2.36, -0.66]$"
#> 
#> $estimate$wt
#> [1] "$b = -3.19$, 95% CI $[-4.74, -1.64]$"
#> 
#> $estimate$modelfit
#>            r2 
#> "$R^2 = .83$" 
#> 
#> 
#> $statistic
#> $statistic$Intercept
#> [1] "$t(29) = 23.14$, $p < .001$"
#> 
#> $statistic$cyl
#> [1] "$t(29) = -3.64$, $p = .001$"
#> 
#> $statistic$wt
#> [1] "$t(29) = -4.22$, $p < .001$"
#> 
#> $statistic$modelfit
#> $statistic$modelfit$r2
#> [1] "$F(2, 29) = 70.91$, $p < .001$"
#> 
#> $statistic$modelfit$aic
#> [1] "$\\mathrm{AIC} = 156.01$"
#> 
#> 
#> 
#> $full_result
#> $full_result$Intercept
#> [1] "$b = 39.69$, 95% CI $[36.18, 43.19]$, $t(29) = 23.14$, $p < .001$"
#> 
#> $full_result$cyl
#> [1] "$b = -1.51$, 95% CI $[-2.36, -0.66]$, $t(29) = -3.64$, $p = .001$"
#> 
#> $full_result$wt
#> [1] "$b = -3.19$, 95% CI $[-4.74, -1.64]$, $t(29) = -4.22$, $p < .001$"
#> 
#> $full_result$modelfit
#> $full_result$modelfit$r2
#> [1] "$R^2 = .83$, $F(2, 29) = 70.91$, $p < .001$"
#> 
#> $full_result$modelfit$aic
#> [1] "$\\mathrm{AIC} = 156.01$"
#> 
#> 
#> 
#> $table
#>        term estimate std.error statistic p.value conf.low conf.high
#> 1 Intercept    39.69      1.71     23.14  < .001    36.18     43.19
#> 2       Cyl    -1.51      0.41     -3.64    .001    -2.36     -0.66
#> 3        Wt    -3.19      0.76     -4.22  < .001    -4.74     -1.64
#>         conf.int
#> 1 [36.18, 43.19]
#> 2 [-2.36, -0.66]
#> 3 [-4.74, -1.64]
#> 
#> attr(,"class")
#> [1] "apa_results" "list"

User interface

A final issue to consider is that users may pass inappropriate input to apa_print(). To ensure that we return correct output or informative error messages, we need input validation. Currently, papaja relies on the internal function validate() for this.

in_paren <- TRUE
papaja:::validate(in_paren, check_class = "logical", check_length = 1)
#> [1] TRUE

Please use either validate() or perform input validation using the assertthat package.