this vignette, we apply the functions provided to estimate sensitivity and specificity and apply them to an Ebola-like pathogen. We compare the results to those of a pathogen with a slower mutation rate and higher reproductive number. Then, we will use these results to calculate the false discovery rate of our linkage criteria (genetic distance) given a sample size of 200 and a relevant population of 1000 infected individuals.
In the first part of this vignette, we use the function sens_spec_roc()
, which formats the results of sens_spec_calc()
for plotting ROC curves.
library(phylosamp)
library(ggplot2)
library(reshape2)
library(RColorBrewer)
library(cowplot)
# ebola-like pathogen
<- 1.5
R <- 1
mut_rate
# use simulated generation distributions
data("gen_dist_sim")
<- as.numeric(gen_dist_sim[gen_dist_sim$R == R, -(1:2)])
mean_gens_pdf
# get theoretical genetic distance dist based on mutation rate and generation parameters
<- as.data.frame(gen_dists(mut_rate = mut_rate,
dists mean_gens_pdf = mean_gens_pdf,
max_link_gens = 1))
# reshape dataframe for plotting
<- reshape2::melt(dists,
dists id.vars = "dist",
variable.name = "status",
value.name = "prob")
# get sensitivity and specificity using the same paramters
<- sens_spec_roc(cutoff = 1:(max(dists$dist)-1),
roc_calc mut_rate = mut_rate,
mean_gens_pdf = mean_gens_pdf)
# get the optimal value for the ROC plot
<- get_optim_roc(roc_calc)
roc_optim <- roc_optim$cutoff
threshold <- c(roc_optim$specificity, roc_optim$sensitivity)
optim_point
# plot the distributions of genetic distance
<- RColorBrewer::brewer.pal(5, "PuOr")
pal
<- ggplot(dists, aes(x=dist, y=prob, fill=status)) +
p1 geom_bar(alpha=0.5, stat="identity", position="identity") +
scale_fill_manual(name="", values=c(pal[5], pal[2])) +
scale_x_continuous(limits = c(-1,35), expand = c(0,0)) +
scale_y_continuous(limits = c(0,0.4), expand = c(0,0)) +
xlab('Genetic distance') + ylab('Density') +
theme_classic() + theme(legend.position='none') +
geom_vline(xintercept=threshold, linetype=2, size=0.5, color = "black")
# add ROC curve on top
<- ggplot(data=roc_calc, aes(x=specificity, y=sensitivity)) +
r1 geom_line(size=1.5) + xlim(0,1) + ylim(0,1) +
geom_point(x=optim_point[1], y=optim_point[2],
size = 3, stroke=1, shape=21, fill = "chartreuse") +
geom_abline(slope=1, intercept = 0, linetype=2, alpha=0.5) +
xlab("1-specificity") +
theme_bw()
::ggdraw(p1) + cowplot::draw_plot(r1, hjust=-0.16, vjust=-0.16, scale=0.6) cowplot
Using the closest-to-corner method, the optimal genetic distance threshold for this pathogen is:
threshold
## [1] 5
And the sensitivity and specificity at this threshold is:
$sensitivity ## sensitivity roc_optim
## [1] 0.9963402
1 - roc_optim$specificity ## specificity
## [1] 0.9801562
We repeat the above plot for a pathogen with a higher reproductive number and lower mutation rate:
# faster-spreading pathogen
<- 3
R <- 0.2
mut_rate
<- gen_dist_sim[gen_dist_sim["R"]==R][3:dim(gen_dist_sim)[2]] mean_gens_pdf
For this pathogen, the optimal genetic distance threshold is:
threshold
## [1] 1
And the sensitivity and specificity at this threshold is:
$cutoff==threshold,]$sensitivity ## sensitivity roc_optim[roc_optim
## [1] 0.8187308
1 - roc_optim[roc_optim$cutoff==threshold,]$specificity ## specificity
## [1] 0.8539157
It is well understood that genetic distance is a much more accurate predictor of transmission linkage for pathogens with higher mutation rates and slower spread, as shown clearly in this example (lower sensitivity and specificity in the second example). Genetic distance is clearly not an appropriate metric for all pathogens, but more sophisticated linkage criteria – espeically those that incorporate epidemiological information as well – may produce better results.
This is reflected in our confidence in identifying linked infection pairs. Assuming a sample size of 200, a relevant population of 1000, and the genetic distance threshold and associated sensitivity and specificity for each pathogen we get:
truediscoveryrate(eta=0.9963402, chi=0.9801562, rho=0.20, M=200, R=1) # ebola-like pathogen
## Calculating true discovery rate assuming multiple-transmission and multiple-linkage
## [1] 0.09183886
truediscoveryrate(eta=0.8187308, chi=0.8539157, rho=0.20, M=200, R=1) # slower-mutating pathogen
## Calculating true discovery rate assuming multiple-transmission and multiple-linkage
## [1] 0.01116204
As a final note, we remind the reader that the specificity of the linkage criteria has a significant bearing on the true/false discovery rate, and that high true discovery rates require a specificity upwards of 0.999. Linkage criteria more complex than genetic distance may be needed to achieve this value.