Confidence intervals and tests in emmeans

emmeans package, Version 1.8.1.1

Contents

This vignette describes various ways of summarizing emmGrid objects.

  1. summary(), confint(), and test()
  2. Back-transforming to response scale (See also the “transformations” vignette)
  3. Multiplicity adjustments
  4. Using “by” variables
  5. Joint (omnibus) tests
  6. Testing equivalence, noninferiority, nonsuperiority
  7. Graphics (in “basics” vignette)

Index of all vignette topics

summary(), confint(), and test()

The most important method for emmGrid objects is summary(). For one thing, it is called by default when you display an emmeans() result. The summary() function has a lot of options, and the detailed documentation via help("summary.emmGrid") is worth a look.

For ongoing illustrations, let’s re-create some of the objects in the “basics” vignette for the pigs example:

pigs.lm1 <- lm(log(conc) ~ source + factor(percent), data = pigs)
pigs.rg <- ref_grid(pigs.lm1)
pigs.emm.s <- emmeans(pigs.rg, "source")

Just summary(<object>) by itself will produce a summary that varies somewhat according to context. It does this by setting different defaults for the infer argument, which consists of two logical values, specifying confidence intervals and tests, respectively. [The exception is models fitted using MCMC methods, where summary() is diverted to the hpd.summary() function, a preferable summary for many Bayesians.]

The summary of a newly made reference grid will show just estimates and standard errors, but not confidence intervals or tests (that is, infer = c(FALSE, FALSE)). The summary of an emmeans() result, as we see above, will have intervals, but no tests (i.e., infer = c(TRUE, FALSE)); and the result of a contrast() call (see comparisons and contrasts) will show test statistics and P values, but not intervals (i.e., infer = c(FALSE, TRUE)). There are courtesy methods confint() and test() that just call summary() with the appropriate infer setting; for example,

test(pigs.emm.s)
##  source emmean     SE df t.ratio p.value
##  fish     3.39 0.0367 23  92.540  <.0001
##  soy      3.67 0.0374 23  97.929  <.0001
##  skim     3.80 0.0394 23  96.407  <.0001
## 
## Results are averaged over the levels of: percent 
## Results are given on the log (not the response) scale.

It is not particularly useful, though, to test these EMMs against the default of zero – which is why tests are not usually shown. It makes a lot more sense to test them against some target concentration, say 40. And suppose we want to do a one-sided test to see if the concentration is greater than 40. Remembering that the response is log-transformed in this model,

test(pigs.emm.s, null = log(40), side = ">")
##  source emmean     SE df null t.ratio p.value
##  fish     3.39 0.0367 23 3.69  -8.026  1.0000
##  soy      3.67 0.0374 23 3.69  -0.577  0.7153
##  skim     3.80 0.0394 23 3.69   2.740  0.0058
## 
## Results are averaged over the levels of: percent 
## Results are given on the log (not the response) scale. 
## P values are right-tailed

It is also possible to add calculated columns to the summary, via the calc argument. The calculations can include any columns up through df in the summary, as well as any variable in the object’s grid slot. Among the latter are usually weights in a column named .wgt., and we can use that to include sample size in the summary:

confint(pigs.emm.s, calc = c(n = ~.wgt.))
##  source emmean     SE df  n lower.CL upper.CL
##  fish     3.39 0.0367 23 10     3.32     3.47
##  soy      3.67 0.0374 23 10     3.59     3.74
##  skim     3.80 0.0394 23  9     3.72     3.88
## 
## Results are averaged over the levels of: percent 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

Back to Contents

Back-transforming

Transformations and link functions are supported an several ways in emmeans, making this a complex topic worthy of its own vignette. Here, we show just the most basic approach. Namely, specifying the argument type = "response" will cause the displayed results to be back-transformed to the response scale, when a transformation or link function is incorporated in the model. For example, let’s try the preceding test() call again:

test(pigs.emm.s, null = log(40), side = ">", type = "response")
##  source response   SE df null t.ratio p.value
##  fish       29.8 1.09 23   40  -8.026  1.0000
##  soy        39.1 1.47 23   40  -0.577  0.7153
##  skim       44.6 1.75 23   40   2.740  0.0058
## 
## Results are averaged over the levels of: percent 
## P values are right-tailed 
## Tests are performed on the log scale

Note what changes and what doesn’t change. In the test() call, we still use the log of 40 as the null value; null must always be specified on the linear-prediction scale, in this case the log. In the output, the displayed estimates, as well as the null value, are shown back-transformed. As well, the standard errors are altered (using the delta method). However, the t ratios and P values are identical to the preceding results. That is, the tests themselves are still conducted on the linear-predictor scale (as is noted in the output).

Similar statements apply to confidence intervals on the response scale:

confint(pigs.emm.s, side = ">", level = .90, type = "response")
##  source response   SE df lower.CL upper.CL
##  fish       29.8 1.09 23     28.4      Inf
##  soy        39.1 1.47 23     37.3      Inf
##  skim       44.6 1.75 23     42.3      Inf
## 
## Results are averaged over the levels of: percent 
## Confidence level used: 0.9 
## Intervals are back-transformed from the log scale

With side = ">", a lower confidence limit is computed on the log scale, then that limit is back-transformed to the response scale. (We have also illustrated how to change the confidence level.)

Back to Contents

Multiplicity adjustments

Both tests and confidence intervals may be adjusted for simultaneous inference. Such adjustments ensure that the confidence coefficient for a whole set of intervals is at least the specified level, or to control for multiplicity in a whole family of tests. This is done via the adjust argument. For ref_grid() and emmeans() results, the default is adjust = "none". For most contrast() results, adjust is often something else, depending on what type of contrasts are created. For example, pairwise comparisons default to adjust = "tukey", i.e., the Tukey HSD method. The summary() function sometimes changes adjust if it is inappropriate. For example, with

confint(pigs.emm.s, adjust = "tukey")
## Note: adjust = "tukey" was changed to "sidak"
## because "tukey" is only appropriate for one set of pairwise comparisons
##  source emmean     SE df lower.CL upper.CL
##  fish     3.39 0.0367 23     3.30     3.49
##  soy      3.67 0.0374 23     3.57     3.76
##  skim     3.80 0.0394 23     3.70     3.90
## 
## Results are averaged over the levels of: percent 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95 
## Conf-level adjustment: sidak method for 3 estimates

the adjustment is changed to the Sidak method because the Tukey adjustment is inappropriate unless you are doing pairwise comparisons.

An adjustment method that is usually appropriate is Bonferroni; however, it can be quite conservative. Using adjust = "mvt" is the closest to being the “exact” all-around method “single-step” method, as it uses the multivariate t distribution (and the mvtnorm package) with the same covariance structure as the estimates to determine the adjustment. However, this comes at high computational expense as the computations are done using simulation techniques. For a large set of tests (and especially confidence intervals), the computational lag becomes noticeable if not intolerable.

For tests, adjust increases the P values over those otherwise obtained with adjust = "none". Compare the following adjusted tests with the unadjusted ones previously computed.

test(pigs.emm.s, null = log(40), side = ">", adjust = "bonferroni")
##  source emmean     SE df null t.ratio p.value
##  fish     3.39 0.0367 23 3.69  -8.026  1.0000
##  soy      3.67 0.0374 23 3.69  -0.577  1.0000
##  skim     3.80 0.0394 23 3.69   2.740  0.0175
## 
## Results are averaged over the levels of: percent 
## Results are given on the log (not the response) scale. 
## P value adjustment: bonferroni method for 3 tests 
## P values are right-tailed

Back to Contents

“By” variables

Sometimes you want to break a summary down into smaller pieces; for this purpose, the by argument in summary() is useful. For example,

confint(pigs.rg, by = "source")
## source = fish:
##  percent prediction     SE df lower.CL upper.CL
##        9       3.22 0.0536 23     3.11     3.33
##       12       3.40 0.0493 23     3.30     3.50
##       15       3.44 0.0548 23     3.32     3.55
##       18       3.52 0.0547 23     3.41     3.63
## 
## source = soy:
##  percent prediction     SE df lower.CL upper.CL
##        9       3.49 0.0498 23     3.39     3.60
##       12       3.67 0.0489 23     3.57     3.77
##       15       3.71 0.0507 23     3.61     3.82
##       18       3.79 0.0640 23     3.66     3.93
## 
## source = skim:
##  percent prediction     SE df lower.CL upper.CL
##        9       3.62 0.0501 23     3.52     3.73
##       12       3.80 0.0494 23     3.70     3.90
##       15       3.84 0.0549 23     3.73     3.95
##       18       3.92 0.0646 23     3.79     4.06
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

If there is also an adjust in force when by variables are used, by default, the adjustment is made separately on each by group; e.g., in the above, we would be adjusting for sets of 4 intervals, not all 12 together (but see “cross-adjustments” below.)

There can be a by specification in emmeans() (or equivalently, a | in the formula); and if so, it is passed on to summary() and used unless overridden by another by. Here are examples, not run:

emmeans(pigs.lm, ~ percent | source)     ### same results as above
summary(.Last.value, by = percent)       ### grouped the other way

Specifying by = NULL will remove all grouping.

Adjustments across by groups

As was mentioned, each by group is regarded as a separate family with regards to the adjust procedure. For example, consider a model with interaction for the warpbreaks data, and construct pairwise comparisons of tension by wool:

warp.lm <- lm(breaks ~ wool * tension, data = warpbreaks)
warp.pw <- pairs(emmeans(warp.lm, ~ tension | wool))
warp.pw
## wool = A:
##  contrast estimate   SE df t.ratio p.value
##  L - M      20.556 5.16 48   3.986  0.0007
##  L - H      20.000 5.16 48   3.878  0.0009
##  M - H      -0.556 5.16 48  -0.108  0.9936
## 
## wool = B:
##  contrast estimate   SE df t.ratio p.value
##  L - M      -0.556 5.16 48  -0.108  0.9936
##  L - H       9.444 5.16 48   1.831  0.1704
##  M - H      10.000 5.16 48   1.939  0.1389
## 
## P value adjustment: tukey method for comparing a family of 3 estimates

We have two sets of 3 comparisons, and the (default) Tukey adjustment is made separately in each group.

However, sometimes we want the multiplicity adjustment to be broader. This broadening can be done in two ways. One is to remove the by variable, which then treats all results as one family. In our example:

test(warp.pw, by = NULL, adjust = "bonferroni")
##  contrast wool estimate   SE df t.ratio p.value
##  L - M    A      20.556 5.16 48   3.986  0.0014
##  L - H    A      20.000 5.16 48   3.878  0.0019
##  M - H    A      -0.556 5.16 48  -0.108  1.0000
##  L - M    B      -0.556 5.16 48  -0.108  1.0000
##  L - H    B       9.444 5.16 48   1.831  0.4396
##  M - H    B      10.000 5.16 48   1.939  0.3504
## 
## P value adjustment: bonferroni method for 6 tests

This accomplishes the goal of putting all the comparisons in one family of 6 comparisons. Note that the Tukey adjustment may not be used here because we no longer have one set of pairwise comparisons.

An alternative is to specify cross.adjust, which specifies an additional adjustment method to apply to corresponding sets of within-group adjusted P values:

test(warp.pw, adjust = "tukey", cross.adjust = "bonferroni")
## wool = A:
##  contrast estimate   SE df t.ratio p.value
##  L - M      20.556 5.16 48   3.986  0.0013
##  L - H      20.000 5.16 48   3.878  0.0018
##  M - H      -0.556 5.16 48  -0.108  1.0000
## 
## wool = B:
##  contrast estimate   SE df t.ratio p.value
##  L - M      -0.556 5.16 48  -0.108  1.0000
##  L - H       9.444 5.16 48   1.831  0.3407
##  M - H      10.000 5.16 48   1.939  0.2777
## 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## Cross-group P-value adjustment: bonferroni

These adjustments are less conservative than the previous result, but it is still a conservative adjustment to the set of 6 tests. Had we also specified adjust = "bonferroni", we would have obtained the same adjusted P values as we obtained with by = NULL.

Simple comparisons

There is also a simple argument for contrast() that is in essence the inverse of by; the contrasts are run using everything except the specified variables as by variables. To illustrate, let’s consider the model for pigs that includes the interaction (so that the levels of one factor compare differently at levels of the other factor).

pigsint.lm <- lm(log(conc) ~ source * factor(percent), data = pigs)
pigsint.rg <- ref_grid(pigsint.lm)
contrast(pigsint.rg, "consec", simple = "percent")
## source = fish:
##  contrast              estimate     SE df t.ratio p.value
##  percent12 - percent9    0.1849 0.1061 17   1.742  0.2357
##  percent15 - percent12   0.0045 0.1061 17   0.042  0.9999
##  percent18 - percent15   0.0407 0.1061 17   0.383  0.9626
## 
## source = soy:
##  contrast              estimate     SE df t.ratio p.value
##  percent12 - percent9    0.1412 0.0949 17   1.487  0.3593
##  percent15 - percent12  -0.0102 0.0949 17  -0.108  0.9992
##  percent18 - percent15   0.0895 0.1342 17   0.666  0.8572
## 
## source = skim:
##  contrast              estimate     SE df t.ratio p.value
##  percent12 - percent9    0.2043 0.0949 17   2.152  0.1179
##  percent15 - percent12   0.1398 0.1061 17   1.317  0.4521
##  percent18 - percent15   0.1864 0.1424 17   1.309  0.4568
## 
## Results are given on the log (not the response) scale. 
## P value adjustment: mvt method for 3 tests

In fact, we may do all one-factor comparisons by specifying simple = "each". This typically produces a lot of output, so use it with care.

Back to Contents

Joint tests

From the above, we already know how to test individual results. For pairwise comparisons (details in the “comparisons” vignette), we might do

pigs.prs.s <- pairs(pigs.emm.s)
pigs.prs.s
##  contrast    estimate     SE df t.ratio p.value
##  fish - soy    -0.273 0.0529 23  -5.153  0.0001
##  fish - skim   -0.402 0.0542 23  -7.428  <.0001
##  soy - skim    -0.130 0.0530 23  -2.442  0.0570
## 
## Results are averaged over the levels of: percent 
## Results are given on the log (not the response) scale. 
## P value adjustment: tukey method for comparing a family of 3 estimates

But suppose we want an omnibus test that all these comparisons are zero. Easy enough, using the joint argument in test (note: the joint argument is not available in summary(); only in test()):

test(pigs.prs.s, joint = TRUE)
##  df1 df2 F.ratio p.value note
##    2  23  28.849  <.0001  d  
## 
## d: df1 reduced due to linear dependence

Notice that there are three comparisons, but only 2 d.f. for the test, as cautioned in the message.

The test produced with joint = TRUE is a “type III” test (assuming the default equal weights are used to obtain the EMMs). See more on these types of tests for higher-order effects in the “interactions” vignette section on contrasts.

For convenience, there is also a joint_tests() function that performs joint tests of contrasts among each term in a model or emmGrid object.

joint_tests(pigsint.rg)
##  model term     df1 df2 F.ratio p.value
##  source           2  17  30.256  <.0001
##  percent          3  17   8.214  0.0013
##  source:percent   6  17   0.926  0.5011

The tests of main effects are of families of contrasts; those for interaction effects are for interaction contrasts. These results are essentially the same as a “Type-III ANOVA”, but may differ in situations where there are empty cells or other non-estimability issues, or if generalizations are present such as unequal weighting. (Another distinction is that sums of squares and mean squares are not shown; that is because these really are tests of contrasts among predictions, and they may or may not correspond to model sums of squares.)

One may use by variables with joint_tests. For example:

joint_tests(pigsint.rg, by = "source")
## source = fish:
##  model term df1 df2 F.ratio p.value
##  percent      3  17   1.712  0.2023
## 
## source = soy:
##  model term df1 df2 F.ratio p.value
##  percent      3  17   1.290  0.3097
## 
## source = skim:
##  model term df1 df2 F.ratio p.value
##  percent      3  17   6.676  0.0035

In some models, it is possible to specify submodel = "type2", thereby obtaining something akin to a Type II analysis of variance. See the messy-data vignette for an example.

Back to Contents

Testing equivalence, noninferiority, and nonsuperiority

The delta argument in summary() or test() allows the user to specify a threshold value to use in a test of equivalence, noninferiority, or nonsuperiority. An equivalence test is kind of a backwards significance test, where small P values are associated with small differences relative to a specified threshold value delta. The help page for summary.emmGrid gives the details of these tests. Suppose in the present example, we consider two sources to be equivalent if they are within 25% of each other. We can test this as follows:

test(pigs.prs.s, delta = log(1.25), adjust = "none")
##  contrast    estimate     SE df t.ratio p.value
##  fish - soy    -0.273 0.0529 23   0.937  0.8209
##  fish - skim   -0.402 0.0542 23   3.308  0.9985
##  soy - skim    -0.130 0.0530 23  -1.765  0.0454
## 
## Results are averaged over the levels of: percent 
## Results are given on the log (not the response) scale. 
## Statistics are tests of equivalence with a threshold of 0.22314 
## P values are left-tailed

By our 25% standard, the P value is quite small for comparing soy and skim, providing statistical evidence that their difference is enough smaller than the threshold to consider them equivalent.

Back to Contents

Graphics

Graphical displays of emmGrid objects are described in the “basics” vignette

Index of all vignette topics