Estimating sample size from the false discovery rate

In this vignette we describe how the phylosamp package can be used in reverse, to calculate the necessary sample size to achieve a pre-specified false discovery rate. For example, this calculation would be useful if a researcher is conducting a study for which specimens have yet to be collected, but they want to ensure at least 75% confidence in identified links. To calculate the proportion of links that represent true transmission pairs, the user needs to provide the sensitivity and specificity of the linkage criteria used, the final size of the outbreak, and the desired minimum true discovery rate (\(1-\text{False Discovery Rate}\)):

       \(\eta\) (eta): the sensitivity of the linkage criteria for identifying transmission links
       \(\chi\) (chi): the specificity of the linkage criteria for identifying transmission links
       N: the final size of the outbreak (total number of infections)
       \(R\): the average reproductive number (also denoted \(R_\text{pop}\))

\(\phi\) (phi): the desired minimum true discovery rate

Imagine we are interested in collecting enough samples from an outbreak of 100 cases to ensure that the identified links are correct at least 75% of the time. The phylogenetic criteria we are using to identify links has a sensitivity of 99% and a specificity of 99.5%:

library(phylosamp)

samplesize(eta=0.99, chi=0.995, N=100, R=1, phi=0.75)

Although 10 samples will ensure a false discovery rate of no more than 25%, 10 samples may not provide enough data for analysis. In this case, we can use the optional min_pairs parameter to require that the expected number of links (calculated using the exp_links function) is at least some minimum value:

samplesize(eta=0.99, chi=0.995, N=100, R=1, phi=0.75, min_pairs=30)

In another example, it may be crucial that links are identified with high certainty. So we increase the minimum true discovery rate to 95%:

samplesize(eta=0.99, chi=0.995, N=100, R=1, phi=0.95)

This result suggests it is not possible to obtain such high certainty with this linkage criteria. We can confirm this with the truediscoveryrate function, which shows that even with 100% sampling of this outbreak, we will correctly identify transmission links no more than 80% of the time.

truediscoveryrate(eta=0.99, chi=0.995, rho=1, M=100, R=1)

Further exploration of the samplesize function reveals that there are limited combinations of sensitivity and specificity that produce reasonable sample sizes, and that specificity in particular affects the minimum false discovery rate that can be obtained. Therefore, correctly estimating sensitivity and specificity are of key importance when using these functions to understand transmission.