Processing Raw Data with openEBGM

2020-03-02


Initial Calculations

##Using processRaw()

The processRaw() function calculates actual counts \((N)\) of each product-symptom combination, expected counts \((E)\) under the row/column independence assumption, relative reporting ratio \((RR)\), and proportional reporting ratio \((PRR)\). processRaw() has various parameters, some of which are shown below.

Suppose the data look as so:

dat
#>         var1    var2 id
#> 1  product_B event_1  1
#> 2  product_A event_1  2
#> 3  product_B event_2  3
#> 4  product_A event_1  4
#> 5  product_A event_1  5
#> 6  product_A event_1  6
#> 7  product_A event_2  7
#> 8  product_A event_2  8
#> 9  product_A event_1  9
#> 10 product_A event_2 10
#> 11 product_A event_2 11
#> 12 product_B event_2 12
#> 13 product_B event_1 13
#> 14 product_B event_2 14
#> 15 product_B event_1 15
#> 16 product_B event_2 16
#> 17 product_C event_1 17

We can calculate \(N\), \(E\), \(RR\), and \(PRR\) for the product-symptom pairs:

processRaw(data = dat, stratify = FALSE, zeroes = FALSE)
#>        var1    var2 N         E   RR  PRR
#> 1 product_A event_1 5 4.7647059 1.05 1.11
#> 2 product_A event_2 4 4.2352941 0.94 0.89
#> 3 product_B event_1 3 3.7058824 0.81 0.71
#> 4 product_B event_2 4 3.2941176 1.21 1.43
#> 5 product_C event_1 1 0.5294118 1.89 2.00

##Using stratification

Stratification can help control for confounding variables. For instance, food, cosmetics, and dietary supplements are often consumed at different rates by different genders and age groups. Similarly, adverse events associated with these products occur with varying rates. Therefore, we might wish to control for these variables when we examine the CAERS data.

Now assume the data look as so:

dat
#>         var1    var2 strat1   strat2 id
#> 1  product_B event_1      F age_cat2  1
#> 2  product_A event_1      M age_cat1  2
#> 3  product_B event_2      M age_cat1  3
#> 4  product_A event_1      M age_cat1  4
#> 5  product_A event_1      F age_cat1  5
#> 6  product_A event_1      F age_cat1  6
#> 7  product_A event_2      F age_cat1  7
#> 8  product_A event_2      F age_cat1  8
#> 9  product_A event_1      M age_cat2  9
#> 10 product_A event_2      M age_cat1 10
#> 11 product_A event_2      M age_cat1 11
#> 12 product_B event_2      M age_cat2 12
#> 13 product_B event_1      M age_cat1 13
#> 14 product_B event_2      M age_cat1 14
#> 15 product_B event_1      M age_cat1 15
#> 16 product_B event_2      F age_cat1 16
#> 17 product_C event_1      M age_cat1 17

Notice that now we have stratifications variables (‘strat’ substring) present. We can use these stratification variables to get adjusted estimates for the \(EBGM\) scores. Stratification will affect \(E\) and \(RR\), but not \(PRR\). The \(E\)s are calculated by summing the expected counts from every stratum. Ideally, each stratum should contain several unique CAERS reports to insure good estimates of \(E\).

processRaw(data = dat, stratify = TRUE, zeroes = FALSE)
#> stratification variables used: strat1, strat2
#> there were 4 strata:  F-age_cat1, F-age_cat2, M-age_cat1, M-age_cat2
#> Warning in .checkStrata_processRaw(data, max_cats): at least one stratum
#> contains less than 50 unique IDs
#>        var1    var2 N         E   RR  PRR
#> 1 product_A event_1 5 4.3222222 1.16 1.11
#> 2 product_A event_2 4 4.6777778 0.86 0.89
#> 3 product_B event_1 3 4.1222222 0.73 0.71
#> 4 product_B event_2 4 2.8777778 1.39 1.43
#> 5 product_C event_1 1 0.5555556 1.80 2.00

Notice that we use stratify = TRUE to accomodate the new stratification variables. The calculations for \(E\) and \(RR\) are adjusted.

Finally, in some cases one may wish to calculate the \(E\)s for product-symptom combinations that do not occur in the data. These can be calculated by using the zeroes = TRUE argument in the processRaw() function. It is typically not required to perform such calculations for zero counts, and doing so can lead to much longer execution times when estimating hyperparameters. For this reason, zero counts are only recommended for hyperparameter estimation when convergence of optimization routines cannot be reached otherwise. If zero counts are used, data squashing should typically follow. Even if zero counts are used for hyperparameter estimation, \(EBGM\) scores for zero counts never add value to an analysis. For this reason, rows with zero counts should be removed after estimating hyperparameters but before calculating \(EBGM\) and quantile scores.

processRaw(data = dat, stratify = FALSE, zeroes = TRUE)
#>        var1    var2 N         E   RR  PRR
#> 1 product_A event_1 5 4.7647059 1.05 1.11
#> 2 product_A event_2 4 4.2352941 0.94 0.89
#> 3 product_B event_1 3 3.7058824 0.81 0.71
#> 4 product_B event_2 4 3.2941176 1.21 1.43
#> 5 product_C event_1 1 0.5294118 1.89 2.00
#> 6 product_C event_2 0 0.4705882 0.00 0.00

Next, the Hyperparameter Estimation with openEBGM vignette will demonstrate how to estimate the hyperparameters of the prior distribution.