epifitter
provides functions for the fit of (non-flexible) two-parameter population dynamics models to disease progress curve (DPC) data: exponential, monomolecular, logistic and Gompertz.
The goal of fitting these models to DPCs is to estimate numerically meaningful parameters: y0
that represents primary inoculum and r
, representing the apparent infection rate. Hence, the importance of choosing the model that best describe the epidemics to better understand its proterties and compare epidemics.
Two approaches can be used to obtain the parameters:
Both approaches are available in epifitter. The simplest way to fit these models to single epidemic data is using the fit_lin()
and fit_nlin()
functions. For the latter, the alternative fit_nlin2()
allows estimating a third parameter, the upper asymptote, when maximum disease intensity is not close to 100%. The fit_multi()
function is the most flexible and allows to fit all models to multiple epidemic datasets.
First, we need to load the packages we’ll need for this tutorial.
library(epifitter)
library(ggplot2)
library(dplyr)
library(magrittr)
library(cowplot)
To use epifitter data at least two variables are needed, one representing the time of each assessment of disease intensity during the course of the epidemics and the other representing a disease intensity variable as proportion (e.g. incidence, severity, prevalence). For the case of designed experiments with replicates, a third variable is needed.
Let’s simulate a DPC dataset for one epidemics measured at replicated plots. The simulated data resemble a polyciclic epidemics of sigmoid shape. We can do that using the epifitter sim_logistic()
function of epifitter (more about ?sim_logistic
here).
<- sim_logistic(
dpcL N = 100, # duration of the epidemics in days
y0 = 0.01, # disease intensity at time zero
dt = 10, # interval between assessments
r = 0.1, # apparent infection rate
alpha = 0.2, # level of noise
n = 7 # number of replicates
)
Let’s give a look at the simulated dataset.
head(dpcL)
## replicates time y random_y
## 1 1 0 0.01000000 0.01055540
## 2 1 10 0.02672668 0.02336951
## 3 1 20 0.06946279 0.06498207
## 4 1 30 0.16868385 0.21762860
## 5 1 40 0.35549349 0.35448883
## 6 1 50 0.59989486 0.51236621
The dpc_L
object generated using sim_logistic()
is a dataframe with four columns. The y
variable is a vector for the disease intensity as proportion (0 < y < 1). To facilitate visualization, let’s make a plot using the ggplot
function of the ggplot2 package.
ggplot(
dpcL,aes(time, y,
group = replicates
)+
) geom_point(aes(time, random_y), shape = 1) + # plot the replicate values
geom_point(color = "steelblue", size = 2) +
geom_line(color = "steelblue") +
labs(
title = "Simulated 'complete' epidemics of sigmoid shape",
subtitle = "Produced using sim_logistic()"
+
)theme_minimal_hgrid()
fit_lin()
The fit_lin()
requires at least the time
and y
arguments. In the example, we will call the random_y
which represents the replicates. A quick way to call these variables attached to the dataframe is shown below.
<- fit_lin(
f_lin time = dpcL$time,
y = dpcL$random_y
)
fit_lin()
outputs a list object which contains several elements. Three elements of the list are shown by default: stats of model fit, Infection rate and Initial Inoculum
f_lin
## Results of fitting population models
##
## Stats:
## CCC r_squared RSE
## Logistic 0.9982 0.9964 0.1905
## Gompertz 0.9781 0.9571 0.4786
## Monomolecular 0.9349 0.8777 0.6504
## Exponential 0.9095 0.8340 0.6413
##
## Infection rate:
## Estimate Std.error Lower Upper
## Logistic 0.09924462 0.0006863749 0.09787729 0.09787729
## Gompertz 0.07053209 0.0017248354 0.06709605 0.06709605
## Monomolecular 0.05438023 0.0023440391 0.04971066 0.04971066
## Exponential 0.04486439 0.0023110106 0.04026062 0.04026062
##
## Initial inoculum:
## Estimate Linearized lin.SE Lower Upper
## Logistic 1.020060e-02 -4.575056 0.04060648 9.415405e-03 0.0110505437
## Gompertz 3.796373e-05 -2.320315 0.10204264 3.827309e-06 0.0002469124
## Monomolecular -1.738403e+00 -1.007375 0.13867523 -2.609720e+00 -1.0774053691
## Exponential 2.822122e-02 -3.567681 0.13672123 2.149266e-02 0.0370562487
The Stats
element of the list shows how each of the four models predicted the observations based on three measures:
CCC
(Lin 2000), a measure of agreement that takes both bias and precision into accountr_squared
(R2), a measure of precisionRSE
for each model.The four models are sorted from the high to the low CCC
. As expected because the sim_logistic
function was used to create the synthetic epidemic data, the the Logistic model was superior to the others.
The estimates, and respective standard error and upper and lower 95% confidence interval, for the two coefficients of interest are shown in the Infection rate
and Initial inoculum
elements. For the latter, both the back-transformed (estimate) and the linearized estimate are shown.
The element f_lin$stats_all
provides a wide format dataframe with all the stats for each model.
$stats_all f_lin
## # A tibble: 4 x 14
## best_model model r r_se r_ci_lwr r_ci_upr v0 v0_se r_squared RSE
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Logi~ 0.0992 6.86e-4 0.0979 0.101 -4.58 0.0406 0.996 0.190
## 2 2 Gomp~ 0.0705 1.72e-3 0.0671 0.0740 -2.32 0.102 0.957 0.479
## 3 3 Mono~ 0.0544 2.34e-3 0.0497 0.0590 -1.01 0.139 0.878 0.650
## 4 4 Expo~ 0.0449 2.31e-3 0.0403 0.0495 -3.57 0.137 0.834 0.641
## # ... with 4 more variables: CCC <dbl>, y0 <dbl>, y0_ci_lwr <dbl>,
## # y0_ci_upr <dbl>
The predicted values are stored as a dataframe in the data
element called using the same $
operator as above. Both the observed and (y
) and the back-transformed predictions (predicted
) are shown for each model. The linearized value and the residual are also shown.
head(f_lin$data)
## time y model linearized predicted residual
## 1 0 0.01055540 Exponential -4.55111775 2.822122e-02 -0.0176658249
## 2 0 0.01055540 Monomolecular 0.01061150 -1.738403e+00 1.7489584241
## 3 0 0.01055540 Logistic -4.54050625 1.020060e-02 0.0003548005
## 4 0 0.01055540 Gompertz -1.51537286 3.796373e-05 0.0105174358
## 5 10 0.02336951 Exponential -3.75632317 4.419971e-02 -0.0208302009
## 6 10 0.02336951 Monomolecular 0.02364691 -5.897434e-01 0.6131129216
The plot_fit()
produces, by default, a panel of plots depicting the observed and predicted values by all fitted models. The arguments pont_size
and line_size
that control for the size of the dots for the observation and the size of the fitted line, respectively.
<- plot_fit(f_lin,
plot_lin point_size = 2,
line_size = 1
)
# Default plots
plot_lin
The plots are ggplot2 objects which can be easily customized by adding new layers that override plot paramaters as shown below. The argument models
allows to select the models(s) to be shown on the plot. The next plot was customized for the logistic model.
# Customized plots
plot_fit(f_lin,
point_size = 2,
line_size = 1,
models = "Logistic")+
theme_minimal_hgrid(font_size =18) +
scale_x_continuous(limits = c(0,100))+
scale_color_grey()+
theme(legend.position = "none")+
labs(
x = "Time",
y = "Proportion disease "
)
The fit_nlin()
function uses the Levenberg-Marquardt algorithm for least-squares estimation of nonlinear parameters. In addition to time and disease intensity, starting values for y0
and r
should be given in the starting_par
argument. The output format and interpretation is analogous to the fit_lin()
.
NOTE: If you encounter error messages saying “matrix at initial parameter estimates”, try to modify the starting values for the parameters to solve the problem.
<- fit_nlin(
f_nlin time = dpcL$time,
y = dpcL$random_y,
starting_par = list(y0 = 0.01, r = 0.03)
)
f_nlin
## Results of fitting population models
##
## Stats:
## CCC r_squared RSE
## Logistic 0.9978 0.9956 0.0267
## Gompertz 0.9961 0.9932 0.0355
## Monomolecular 0.9173 0.8706 0.1512
## Exponential 0.8892 0.8278 0.1726
##
## Infection rate:
## Estimate Std.error Lower Upper
## Logistic 0.09720347 0.002091825 0.09303634 0.09303634
## Gompertz 0.06796614 0.001882452 0.06421611 0.06421611
## Monomolecular 0.02272553 0.001468476 0.01980018 0.01980018
## Exponential 0.01895276 0.001371656 0.01622028 0.01622028
##
## Initial inoculum:
## Estimate Std.error Lower Upper
## Logistic 1.143936e-02 1.119834e-03 9.208533e-03 1.367018e-02
## Gompertz 6.739623e-07 8.038978e-07 -9.274841e-07 2.275409e-06
## Monomolecular -1.901195e-01 4.523036e-02 -2.802230e-01 -1.000160e-01
## Exponential 1.819158e-01 2.072663e-02 1.406262e-01 2.232054e-01
We can check the results using plot_fit
.
plot_fit(f_nlin) +
theme_minimal_hgrid()#changing plot theme
K
(maximum disease)In many epidemics the last measure (final time) of a DPC does not reach the maximum intensity and, for this reason, estimation of maximum asymptote (carrying capacity K
) may be necessary. The fin_lin2()
provides an estimation of K
in addition to the estimates provided by fit_lin()
.
Before demonstrating the function, we can transform our simulated data by creating another variable with y_random2
with maximum about 0.8 (80%). Simplest way is to multiply the y_random
by 0.8.
= dpcL %>%
dpcL2 mutate(random_y = random_y * 0.8)
Then we run the fit_nlin2()
for the new dataset.
<- fit_nlin2(
f_nlin2 time = dpcL2$time,
y = dpcL2$random_y,
starting_par = list(y0 = 0.01, r = 0.2, K = 0.6)
) f_nlin2
## Results of fitting population models
##
## Stats:
## CCC r_squared RSE
## Logistic 0.9978 0.9956 0.0215
## Gompertz 0.9723 0.9721 0.0735
## Monomolecular 0.9491 0.9095 0.0987
##
## Infection rate:
## Estimate Std.error Lower Upper
## Logistic 0.09746631 0.002578143 0.09232925 0.09232925
## Gompertz 0.10291939 0.012676727 0.07766046 0.07766046
## Monomolecular 0.01619790 0.003010469 0.01019941 0.01019941
##
## Initial inoculum:
## Estimate Std.error Lower Upper
## Logistic 9.064582e-03 1.020089e-03 7.032011e-03 1.109715e-02
## Gompertz 2.436658e-20 5.278079e-19 -1.027314e-18 1.076047e-18
## Monomolecular -1.435290e-01 3.143843e-02 -2.061715e-01 -8.088661e-02
##
## Maximum disease intensity:
## Estimate Std.error Lower Upper
## Logistic 0.7990373 0.005392229 0.7882930 0.8097815
## Gompertz 0.6783335 0.015629770 0.6471905 0.7094765
## Monomolecular 1.0000000 0.105065932 0.7906516 1.2093484
plot_fit(f_nlin2)
NOTE: The exponential model is not included because it doesn’t have a maximum asymptote. The estimated value of
K
is the expected 0.8.
Most commonly, there are more than one epidemics to analyse either from observational or experimental studies. When the goal is to fit a common model to all curves, the fit_multi()
function is in hand. Each DPC needs an unique identified to further combined in a single data frame.
Let’s use the sim_
family of functions to create three epidemics and store the data in a single data.frame
. The Gompertz model was used to simulate these data. Note that we allowed to the y0
and r
parameter to differ the DPCs. We should combine the three DPCs using the bind_rows()
function and name the identifier (.id
), automatically created as a character vector, for each epidemics as ‘DPC’.
<- sim_gompertz(N = 60, y0 = 0.001, dt = 5, r = 0.1, alpha = 0.4, n = 4)
epi1 <- sim_gompertz(N = 60, y0 = 0.001, dt = 5, r = 0.12, alpha = 0.4, n = 4)
epi2 <- sim_gompertz(N = 60, y0 = 0.003, dt = 5, r = 0.14, alpha = 0.4, n = 4)
epi3
<- bind_rows(epi1,
multi_epidemic
epi2,
epi3,.id = "DPC"
)head(multi_epidemic)
## DPC replicates time y random_y
## 1 1 1 0 0.00100000 0.001271050
## 2 1 1 5 0.01515505 0.006530227
## 3 1 1 10 0.07878459 0.082728939
## 4 1 1 15 0.21411521 0.146602867
## 5 1 1 20 0.39266393 0.581324792
## 6 1 1 25 0.56723412 0.680656350
We can visualize the three DPCs in a same plot
<- ggplot(multi_epidemic,
p_multi aes(time, y, shape = DPC, group = DPC))+
geom_point(size =2)+
geom_line()+
theme_minimal_grid(font_size =18) +
labs(
x = "Time",
y = "Proportion disease "
) p_multi
Or use facet_wrap()
for ploting them separately.
+
p_multi facet_wrap(~ DPC, ncol = 1)
fit_multi()
fit_multi()
requires at least four arguments: time, disease intensity (as proportion), data and the curve identifier (strata_cols
). The latter argument accepts one or more strata include as c("strata1",strata2")
. In the example below, the stratum name is DPC
, the name of the variable.
By default, the linear regression is fitted to data but adding another argument nlin = T
, the non linear regressions is fitted instead.
<- fit_multi(
multi_fit time_col = "time",
intensity_col = "random_y",
data = multi_epidemic,
strata_cols = "DPC"
)
All parameters of the list can be returned using the $ operator as below.
head(multi_fit$Parameters)
## DPC best_model model r r_se r_ci_lwr r_ci_upr
## 1 1 1 Gompertz 0.09821178 0.002195383 0.09380222 0.10262133
## 2 1 2 Monomolecular 0.06948313 0.002684049 0.06409205 0.07487420
## 3 1 3 Logistic 0.15982894 0.007966187 0.14382838 0.17582949
## 4 1 4 Exponential 0.09034581 0.009484222 0.07129619 0.10939543
## 5 2 1 Gompertz 0.11852166 0.002360533 0.11378039 0.12326293
## 6 2 2 Monomolecular 0.09167920 0.002854558 0.08594565 0.09741274
## v0 v0_se r_squared RSE CCC y0
## 1 -1.9169085 0.07761853 0.9756249 0.2961734 0.9876621 0.001113882
## 2 -0.5664232 0.09489548 0.9305709 0.3620981 0.9640370 -0.761953549
## 3 -4.6898578 0.28164724 0.8895127 1.0746974 0.9415260 0.009104342
## 4 -4.1234347 0.33531788 0.6447423 1.2794915 0.7840040 0.016188816
## 5 -1.8815799 0.08345745 0.9805524 0.3184533 0.9901807 0.001410421
## 6 -0.6581823 0.10092387 0.9537674 0.3851009 0.9763367 -0.931278644
## y0_ci_lwr y0_ci_upr
## 1 0.0003536786 0.002972673
## 2 -1.1319271539 -0.456184984
## 3 0.0051913173 0.015919660
## 4 0.0082549650 0.031747894
## 5 0.0004257066 0.003884162
## 6 -1.3652739486 -0.576915522
Similarly, all data can be returned.
head(multi_fit$Data)
## DPC time y model linearized predicted residual
## 1 1 0 0.001271050 Exponential -6.667911942 0.016188816 -0.0149177657
## 2 1 0 0.001271050 Monomolecular 0.001271858 -0.761953549 0.7632245990
## 3 1 0 0.001271050 Logistic -6.666640083 0.009104342 -0.0078332917
## 4 1 0 0.001271050 Gompertz -1.897306759 0.001113882 0.0001571676
## 5 1 5 0.006530227 Exponential -5.031313542 0.025433054 -0.0189028266
## 6 1 5 0.006530227 Monomolecular 0.006551642 -0.244840653 0.2513708799
If nonlinear regression is preferred, the nlim
argument should be set to TRUE
<- fit_multi(
multi_fit2 time_col = "time",
intensity_col = "random_y",
data = multi_epidemic,
strata_cols = "DPC",
nlin = TRUE)
## Warning in log(y0/1): NaNs produzidos
## Warning in log(y0/1): NaNs produzidos
## Warning in log(y0/1): NaNs produzidos
## Warning in log(y0/1): NaNs produzidos
head(multi_fit2$Parameters)
## DPC model y0 y0_se r r_se df
## 1 1 Gompertz 0.002280391 0.001681920 0.09626016 0.005251680 50
## 2 1 Logistic 0.039144718 0.008600833 0.13602253 0.009151714 50
## 3 1 Monomolecular -0.176213275 0.044607831 0.04294814 0.002682913 50
## 4 1 Exponential 0.233015208 0.029801544 0.02709389 0.002658825 50
## 5 2 Gompertz 0.001446817 0.001145185 0.12045152 0.006359237 50
## 6 2 Logistic 0.034188983 0.007488674 0.17166441 0.011026012 50
## CCC r_squared RSE y0_ci_lwr y0_ci_upr r_ci_lwr
## 1 0.9861538 0.9728696 0.06273872 -0.0010978443 0.005658626 0.08571185
## 2 0.9802648 0.9641285 0.07447002 0.0218694370 0.056419999 0.11764077
## 3 0.9485483 0.9137056 0.11554402 -0.2658107401 -0.086615810 0.03755935
## 4 0.8575975 0.7791407 0.18274395 0.1731570441 0.292873372 0.02175348
## 5 0.9898959 0.9801426 0.05459555 -0.0008533554 0.003746989 0.10767861
## 6 0.9861310 0.9745902 0.06363181 0.0191475399 0.049230427 0.14951801
## r_ci_upr best_model
## 1 0.10680847 1
## 2 0.15440429 2
## 3 0.04833693 3
## 4 0.03243430 4
## 5 0.13322442 1
## 6 0.19381080 2
If you want to estimate K
, set nlin = TRUE
and estimate_K = TRUE
.
NOTE: If you do not set both arguments
TRUE
,K
will not be estimated, becausenlin
defaut isFALSE
. Also remember that when estimating K, we don’t fit the Exponential model.
<- fit_multi(
multi_fit_K time_col = "time",
intensity_col = "random_y",
data = multi_epidemic,
strata_cols = "DPC",
nlin = T,
estimate_K = T
)
## Warning in log(y0/K): NaNs produzidos
## Warning in log(y0/K): NaNs produzidos
## Warning in log(y0/K): NaNs produzidos
## Warning in log(y0/K): NaNs produzidos
## Warning in log(y0/K): NaNs produzidos
## Warning in log(y0/K): NaNs produzidos
## Warning in log(y0/K): NaNs produzidos
## Warning in log(y0/K): NaNs produzidos
head(multi_fit_K$Parameters)
## DPC model y0 y0_se r r_se K
## 1 1 Gompertz 0.002280415 0.002107445 0.09626007 0.008097060 1
## 2 1 Logistic 0.039144032 0.009818978 0.13602314 0.011629011 1
## 3 1 Monomolecular -0.176212702 0.049324231 0.04294809 0.006965794 1
## 4 2 Gompertz 0.001446733 0.001328401 0.12045191 0.008414188 1
## 5 2 Logistic 0.034187990 0.008132438 0.17166556 0.012741923 1
## 6 2 Monomolecular -0.172163373 0.049285646 0.05279902 0.007127714 1
## K_se df CCC r_squared RSE y0_ci_lwr y0_ci_upr
## 1 0.02357226 49 0.9861538 0.9728695 0.06337568 -0.001954653 0.006515484
## 2 0.02316135 49 0.9802649 0.9641283 0.07522608 0.019412057 0.058876008
## 3 0.06896096 49 0.9485483 0.9137058 0.11671708 -0.275333455 -0.077091949
## 4 0.01519960 49 0.9898959 0.9801426 0.05514984 -0.001222788 0.004116254
## 5 0.01540082 49 0.9861310 0.9745901 0.06427783 0.017845244 0.050530736
## 6 0.04913822 49 0.9532828 0.9225229 0.11357507 -0.271206587 -0.073120159
## r_ci_lwr r_ci_upr K_ci_lwr K_ci_upr best_model
## 1 0.07998842 0.11253172 0.9526298 1.047370 1
## 2 0.11265377 0.15939252 0.9534555 1.046544 2
## 3 0.02894980 0.05694637 0.8614178 1.138582 3
## 4 0.10354297 0.13736086 0.9694553 1.030545 1
## 5 0.14605970 0.19727141 0.9690509 1.030949 2
## 6 0.03847535 0.06712270 0.9012531 1.098747 3
Use ggplot2
to produce elegant data visualizations of models curves and the estimated parameters.
The original data and the predicted values by each model are stored in multi_fit$Data
. A nice plot can be produced as follows:
$Data %>%
multi_fitggplot(aes(time, predicted, color = DPC)) +
geom_point(aes(time, y), color = "gray") +
geom_line(size = 1) +
facet_grid(DPC ~ model, scales = "free_y") +
theme_minimal_hgrid()+
coord_cartesian(ylim = c(0, 1))
Using the dplyr function filter
only the model of interest can be chosen for plotting.
$Data %>%
multi_fitfilter(model == "Gompertz") %>%
ggplot(aes(time, predicted, color = DPC)) +
geom_point(aes(time, y),
color = "gray",
size = 2
+
) geom_line(size = 1.2) +
theme_minimal_hgrid() +
labs(
x = "Time",
y = "Disease Intensity"
)
The multi_fit$Parameters
element is where all stats and parameters as stored. Let’s plot the estimates of the apparent infection rate.
$Parameters %>%
multi_fitfilter(model == "Gompertz") %>%
ggplot(aes(DPC, r)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = r_ci_lwr, ymax = r_ci_upr),
width = 0,
size = 1
+
) labs(
x = "Time",
y = "Apparent infection rate"
+
) theme_minimal_hgrid()
Lin L (2000). A note on the concordance correlation coefficient. Biometrics 56: 324 - 325.