In Bayesian statistics, the gamma distribution is the conjugate prior distribution for a Poisson likelihood. ‘Conjugate’ means that the posterior distribution will follow the same general form as the prior distribution. DuMouchel (1999) used a model with a Poisson(\(\mu_{ij}\)) likelihood for the counts (for row i and column j of the contingency table). We are interested in the ratio \(\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}\), where \(E_{ij}\) are the expected counts. The \(\lambda_{ij}\)s are considered random draws from a mixture of two gamma distributions (our prior) with hyperparameter \(\theta=(\alpha_1,\beta_1,\alpha_2,\beta_2,P)\), where \(P\) is the prior probability that \(\lambda\) came from the first component of the prior mixture (i.e., the mixture fraction). The prior is a single distribution that models all the cells in the table; however, there is a separate posterior distribution for each cell in the table. The posterior distribution of \(\lambda\), given count \(N=n\), is a mixture of two gamma distributions with parameters \(\theta=(\alpha_1+n,\beta_1+E,\alpha_2+n,\beta_2+E,Q_n)\) (subscripts suppressed for clarity), where \(Q_n\) is the probability that \(\lambda\) came from the first component of the posterior, given \(N=n\) (i.e., the mixture fraction).
The posterior distribution, in a sense, is a Bayesian representation of the relative reporting ratio, \(RR\) (note the similarity in the equations \(RR_{ij}=\frac{N_{ij}}{E_{ij}}\) and \(\lambda_{ij}=\frac{\mu_{ij}}{E_{ij}}\)). The Empirical Bayes (EB) metrics are taken from the posterior distribution. The Empirical Bayes Geometric Mean \((EBGM)\) is the antilog of the mean of the log2-transformed posterior distribution. The \(EBGM\) is therefore a measure of central tendency of the posterior distribution. The 5% and 95% quantiles of the posterior distributions can be used to create two-sided 90% credibility intervals for \(\lambda_{ij}\), given \(N_{ij}\) (i.e, our “sort of” RR). Alternatively, since we are most interested in the lower bound, we could ignore the upper bound and create a one-sided 95% credibility interval.
Due to Bayesian shrinkage (please see the Background section of the Introduction to openEBGM vignette), the EB scores are much more stable than \(RR\) for small counts.
Once the product/event combinations have been counted and the hyperparameters have been estimated, we can calculate the EB scores:
library(openEBGM)
data(caers) #subset of publicly available CAERS data
processed <- processRaw(caers, stratify = FALSE, zeroes = FALSE)
squashed <- squashData(processed)
squashed2 <- squashData(squashed, count = 2, bin_size = 10, keep_pts = 50)
theta_init <- data.frame(alpha1 = c(0.2, 0.1, 0.3, 0.5, 0.2),
beta1 = c(0.1, 0.1, 0.5, 0.3, 0.2),
alpha2 = c(2, 10, 6, 12, 5),
beta2 = c(4, 10, 6, 12, 5),
p = c(1/3, 0.2, 0.5, 0.8, 0.4)
)
hyper_estimates <- autoHyper(squashed2, theta_init = theta_init)
(theta_hat <- hyper_estimates$estimates)
#> alpha1 beta1 alpha2 beta2 P
#> 3.25380903 0.39989105 2.02613680 1.90808515 0.06534565
###Qn()
The Qn()
function calculates the mixture fractions for the posterior distributions. The values returned by Qn()
correspond to the probability that \(\lambda\) came from the first component of the posterior mixture distribution, given \(N=n\) (recall there is a \(\lambda|N=n\) for each cell in the table, but that each \(\lambda\) comes from a common distribution). Thus, the output from Qn()
returns a numeric vector of length equal to the total number of product-symptom combinations, which is also the number of rows in the data frame returned by processRaw()
. When calculating the \(Q_n\)s, be sure to use the full data set from processRaw()
– not the squashed data set or the raw data.
qn <- Qn(theta_hat, N = processed$N, E = processed$E)
head(qn)
#> [1] 0.2819735 0.3409649 0.3482312 0.2819735 0.2226355 0.2556668
identical(length(qn), nrow(processed))
#> [1] TRUE
summary(qn)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0001205 0.2340247 0.3089645 0.2846018 0.3397680 0.9999997
###ebgm()
The ebgm()
function calculates the Empirical Bayes Geometric Mean \((EBGM)\) scores. \(EBGM\) is a measure of central tendency of the posterior distributions, \(\lambda_{ij}|N=n\). Scores much larger than one indicate product/adverse event pairs that are reported at an unusually high rate.
processed$ebgm <- ebgm(theta_hat, N = processed$N, E = processed$E, qn = qn)
head(processed)
#> var1 var2 N E RR PRR
#> 1 1-PHENYLALANINE HEART RATE INCREASED 1 0.0360548272 27.74 27.96
#> 2 11 UNSPECIFIED VITAMINS ASTHMA 1 0.0038736591 258.15 279.58
#> 3 11 UNSPECIFIED VITAMINS CARDIAC FUNCTION TEST 1 0.0002979738 3356.00 Inf
#> 4 11 UNSPECIFIED VITAMINS CHEST PAIN 1 0.0360548272 27.74 27.96
#> 5 11 UNSPECIFIED VITAMINS DYSPNOEA 1 0.0765792610 13.06 13.11
#> 6 11 UNSPECIFIED VITAMINS HYPERSENSITIVITY 1 0.0527413588 18.96 19.06
#> ebgm
#> 1 2.23
#> 2 2.58
#> 3 2.63
#> 4 2.23
#> 5 1.92
#> 6 2.09
###quantBisect()
The quantBisect()
function calculates quantiles of the posterior distribution using the bisection method. quantBisect()
can calculate any quantile of the posterior distribution between 1 and 99%, and these quantiles can be used as limits for credibility intervals. Below, QUANT_05 is the 5th percentile; QUANT_95 is the 95th percentile. These form the lower and upper bounds of 90% credibility intervals for the Empirical Bayes (EB) scores.
processed$QUANT_05 <- quantBisect(5, theta_hat = theta_hat,
N = processed$N, E = processed$E, qn = qn)
processed$QUANT_95 <- quantBisect(95, theta_hat = theta_hat,
N = processed$N, E = processed$E, qn = qn)
head(processed)
#> var1 var2 N E RR PRR
#> 1 1-PHENYLALANINE HEART RATE INCREASED 1 0.0360548272 27.74 27.96
#> 2 11 UNSPECIFIED VITAMINS ASTHMA 1 0.0038736591 258.15 279.58
#> 3 11 UNSPECIFIED VITAMINS CARDIAC FUNCTION TEST 1 0.0002979738 3356.00 Inf
#> 4 11 UNSPECIFIED VITAMINS CHEST PAIN 1 0.0360548272 27.74 27.96
#> 5 11 UNSPECIFIED VITAMINS DYSPNOEA 1 0.0765792610 13.06 13.11
#> 6 11 UNSPECIFIED VITAMINS HYPERSENSITIVITY 1 0.0527413588 18.96 19.06
#> ebgm QUANT_05 QUANT_95
#> 1 2.23 0.49 13.85
#> 2 2.58 0.52 15.78
#> 3 2.63 0.52 16.02
#> 4 2.23 0.49 13.85
#> 5 1.92 0.47 11.77
#> 6 2.09 0.48 12.95
The EB-scores (\(EBGM\) and quantile scores) can be used to look for “signals” in the data. As stated in the Background section of the Introduction to openEBGM vignette, Bayesian shrinkage causes the EB-scores to be far more stable than their \(RR\) counterparts, which allows for better separation between signal and noise. One could, for example, look at all product-symptom combinations where QUANT_05 (the lower part of the 90% two-sided credibility interval) is 2 or greater. This is often used as a conservative alternative to \(EBGM\) since QUANT_05 scores are naturally smaller than \(EBGM\) scores. We can say with high confidence that the “true relative reporting ratios” of product/adverse event combinations above this threshold are much greater than 1, so those combinations are truly reported more than expected. The value of 2 is arbitrarily chosen, and depends on the context. Below is an example of how one may identify product-symptom combinations that require further investigation based on the EB-scores.
suspicious <- processed[processed$QUANT_05 >= 2, ]
nrow(suspicious); nrow(processed); nrow(suspicious)/nrow(processed)
#> [1] 131
#> [1] 17189
#> [1] 0.007621153
From above we see that less than 1% of product-symptom pairs are suspect based on the QUANT_05 score. One may look more closely at these product-symptom combinations to ascertain which products may need further investigation. Subject matter knowledge is required to determine which signals might identify a possible causal relationship. The EB-scores find statistical associations – not necessarily causal relationships.
suspicious <- suspicious[order(suspicious$QUANT_05, decreasing = TRUE),
c("var1", "var2", "N", "E", "QUANT_05", "ebgm",
"QUANT_95")]
head(suspicious, 5)
#> var1 var2 N
#> 13924 REUMOFAN PLUS WEIGHT INCREASED 16
#> 8187 HYDROXYCUT REGULAR RAPID RELEASE CAPLETS EMOTIONAL DISTRESS 19
#> 13886 REUMOFAN PLUS IMMOBILE 6
#> 7793 HYDROXYCUT HARDCORE CAPSULES CARDIO-RESPIRATORY DISTRESS 8
#> 8220 HYDROXYCUT REGULAR RAPID RELEASE CAPLETS INJURY 11
#> E QUANT_05 ebgm QUANT_95
#> 13924 0.40643623 15.68 23.26 33.48
#> 8187 0.89690107 11.65 16.78 23.55
#> 13886 0.07866508 10.16 18.28 30.83
#> 7793 0.30482718 9.00 15.25 24.52
#> 8220 0.56317044 8.98 14.28 21.78
tabbed <- table(suspicious$var1)
head(tabbed[order(tabbed, decreasing = TRUE)])
#>
#> HYDROXYCUT REGULAR RAPID RELEASE CAPLETS
#> 26
#> HYDROXYCUT HARDCORE CAPSULES
#> 13
#> REUMOFAN PLUS
#> 8
#> HYDROXYCUT CAPSULES
#> 5
#> HYDROXYCUT MAX LIQUID CAPS
#> 5
#> HYDROXYCUT CAFFEINE FREE CAPLETS
#> 4
The output above suggests some products which may require further investigation.
Next, the openEBGM Objects and Class Functions vignette will demonstrate the object-oriented features of the openEBGM package.