The Zero-Inflated Poisson model is a model for count data with excess zeros. The response distribution is a mixture of a point mass at zero and a Poisson distribution: if \(Z\) is Bernoulli with probability \(1-p_0\) and \(P\) is Poisson with mean \(\lambda\), then \(Y=Z+(1-Z)P\) is zero-inflated Poisson. The ZIP is a latent-class model; we can have \(Y=0\) either because \(Z=0\) ('structural' zeroes) or because \(P=0\). That’s natural in some ecological examples: if you didn’t see any salmon it could be because the area is salmon-free (it’s Eden Park) or because you just randomly didn’t see any. To turn this into a regression model we typically put a logistic regression structure on \(Z\) and a Poisson regression structure on \(P\).
There isn’t (as far as I know) existing software in R for design-based inference in zero-inflated Poisson models, so it’s a good example for the benefits of svyVGAM
. The pscl
package (Zeileis et al) fits zero-inflated models, and so does VGAM
, so we can compare the model fitted with svyVGAM
to both of those and to other work-arounds.
I’ll do an example with data on number of sexual partners, from NHANES 2003-2004. We will look at questions SXQ200
and SXQ100
: number of male sexual partners. Combining these gives a ‘real’ zero-inflated variable: based on sexual orientation the zeroes would divide into 'never' and 'not yet'.
Here's how I created the dataset, from two NHANES files. It's data(nhanes_sxq)
in the package
library(foreign)
setwd("~/nhanes")
demo = read.xport("demo_c.xpt")
sxq = read.xport("sxq_c.xpt")
merged = merge(demo, sxq, by='SEQN')
merged$total = with(merged, ifelse(RIAGENDR==2, SXQ100+SXQ130, SXQ170+SXQ200))
merged$total[merged$SXQ020==2] = 0
merged$total[merged$total>2000] = NA
merged$age = merged$RIDAGEYR/25
merged$malepartners=with(merged, ifelse(RIAGENDR==2,SXQ100,SXQ200))
merged$malepartners[merged$malepartners>200]=NA
nhanes_sxq<-merged[, c("SDMVPSU","SDMVSTRA","WTINT2YR","RIDAGEYR","RIDRETH1","DMDEDUC","malepartners")]
Start off by loading the packages and setting up a survey design
library(svyVGAM)
## Loading required package: VGAM
## Loading required package: stats4
## Loading required package: splines
## Loading required package: survey
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:VGAM':
##
## calibrate
## The following object is masked from 'package:graphics':
##
## dotchart
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
data(nhanes_sxq)
des = svydesign(id=~SDMVPSU,strat=~SDMVSTRA,weights=~WTINT2YR, nest=TRUE, data=nhanes_sxq)
First, we'll fit the model just ignoring the survey design, using both pscl::zeroinfl
and VGAM::vglm
. These models use the same variables in a logistic regression for \(Z\) and a Poisson regression for \(P\). In VGAM
you would make the models different by constraining coefficients to be zero in one of the models; in pscl
you would specify different models before and after the |
.
unwt = zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq)
summary(unwt)
##
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC |
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.0204 -0.9433 -0.7871 0.1503 29.2567
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.6666228 0.0506660 32.894 < 2e-16 ***
## RIDAGEYR -0.0055102 0.0008969 -6.143 8.08e-10 ***
## factor(RIDRETH1)2 -0.0394019 0.0779480 -0.505 0.613
## factor(RIDRETH1)3 0.6508821 0.0345734 18.826 < 2e-16 ***
## factor(RIDRETH1)4 0.6675311 0.0365963 18.240 < 2e-16 ***
## factor(RIDRETH1)5 0.5642954 0.0594928 9.485 < 2e-16 ***
## DMDEDUC 0.0094256 0.0135180 0.697 0.486
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.188125 0.187079 1.006 0.31461
## RIDAGEYR -0.002938 0.003629 -0.810 0.41823
## factor(RIDRETH1)2 -0.079636 0.242311 -0.329 0.74242
## factor(RIDRETH1)3 0.118369 0.116120 1.019 0.30803
## factor(RIDRETH1)4 0.143300 0.127764 1.122 0.26203
## factor(RIDRETH1)5 0.259515 0.223032 1.164 0.24460
## DMDEDUC -0.148881 0.053337 -2.791 0.00525 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 18
## Log-likelihood: -9518 on 14 Df
vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef")
##
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
## family = zipoisson(), data = nhanes_sxq, crit = "coef")
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 RIDAGEYR:1 RIDAGEYR:2
## 0.188125349 1.666622759 -0.002937819 -0.005510160
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2
## -0.079635992 -0.039401949 0.118369301 0.650882145
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2
## 0.143300364 0.667531080 0.259515415 0.564295398
## DMDEDUC:1 DMDEDUC:2
## -0.148881313 0.009425589
##
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -9517.556
A traditional work-around for regression models is to rescale the weights to sum to the sample size and then pretend they are precision weights or frequency weights.
nhanes_sxq$scaledwt<-nhanes_sxq$WTINT2YR/mean(nhanes_sxq$WTINT2YR)
wt= zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, data=nhanes_sxq, weights=scaledwt)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
summary(wt)
##
## Call:
## zeroinfl(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC |
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, data = nhanes_sxq, weights = scaledwt)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.5864 -0.8418 -0.5430 0.1324 31.9106
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8340681 0.0614994 29.823 < 2e-16 ***
## RIDAGEYR -0.0073881 0.0009059 -8.155 3.49e-16 ***
## factor(RIDRETH1)2 -0.1073312 0.0853535 -1.257 0.2086
## factor(RIDRETH1)3 0.6551367 0.0481679 13.601 < 2e-16 ***
## factor(RIDRETH1)4 0.6358148 0.0529173 12.015 < 2e-16 ***
## factor(RIDRETH1)5 0.4774124 0.0666607 7.162 7.96e-13 ***
## DMDEDUC -0.0237389 0.0143070 -1.659 0.0971 .
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.660504 0.214458 3.080 0.002071 **
## RIDAGEYR -0.007833 0.003673 -2.133 0.032959 *
## factor(RIDRETH1)2 -0.116789 0.252451 -0.463 0.643636
## factor(RIDRETH1)3 0.101971 0.151531 0.673 0.500987
## factor(RIDRETH1)4 -0.160804 0.181429 -0.886 0.375444
## factor(RIDRETH1)5 0.106779 0.230339 0.464 0.642954
## DMDEDUC -0.202397 0.057624 -3.512 0.000444 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 18
## Log-likelihood: -9766 on 14 Df
wtv= vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=scaledwt)
summary(wtv)
##
## Call:
## vglm(formula = malepartners ~ RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
## family = zipoisson(), data = nhanes_sxq, weights = scaledwt,
## crit = "coef")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## logitlink(pstr0) -1.828 -0.9335 -0.365675 0.8834927 1.852
## loglink(lambda) -5.851 -1.2771 -0.002727 -0.0003665 65.774
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 0.6605042 0.2144354 3.080 0.002069 **
## (Intercept):2 1.8340681 0.0614568 29.843 < 2e-16 ***
## RIDAGEYR:1 -0.0078333 0.0036726 -2.133 0.032934 *
## RIDAGEYR:2 -0.0073881 0.0008995 -8.214 < 2e-16 ***
## factor(RIDRETH1)2:1 -0.1167894 0.2527278 -0.462 0.643999
## factor(RIDRETH1)2:2 -0.1073312 0.0847641 -1.266 0.205430
## factor(RIDRETH1)3:1 0.1019712 0.1515002 0.673 0.500899
## factor(RIDRETH1)3:2 0.6551367 0.0481359 13.610 < 2e-16 ***
## factor(RIDRETH1)4:1 -0.1608040 0.1814098 -0.886 0.375395
## factor(RIDRETH1)4:2 0.6358147 0.0529738 12.002 < 2e-16 ***
## factor(RIDRETH1)5:1 0.1067789 0.2303235 0.464 0.642931
## factor(RIDRETH1)5:2 0.4774124 0.0663590 7.194 6.27e-13 ***
## DMDEDUC:1 -0.2023967 0.0576221 -3.512 0.000444 ***
## DMDEDUC:2 -0.0237389 0.0146964 -1.615 0.106249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(pstr0), loglink(lambda)
##
## Log-likelihood: -9765.52 on 5036 degrees of freedom
##
## Number of Fisher scoring iterations: 8
##
## No Hauck-Donner effect found in any of the estimates
## repwts
repdes = as.svrepdesign(des,type="Fay",fay.rho=0.2)
rep1 = withReplicates(repdes, quote(
coef(zeroinfl(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC|RIDAGEYR+factor(RIDRETH1)+DMDEDUC, weights=.weights))
))
rep1
## theta SE
## count_(Intercept) 1.8335175 0.1350
## count_RIDAGEYR -0.0073709 0.0028
## count_factor(RIDRETH1)2 -0.1071380 0.2471
## count_factor(RIDRETH1)3 0.6552029 0.1858
## count_factor(RIDRETH1)4 0.6361156 0.1438
## count_factor(RIDRETH1)5 0.4769791 0.2501
## count_DMDEDUC -0.0238310 0.0797
## zero_(Intercept) 0.6606269 0.2582
## zero_RIDAGEYR -0.0078221 0.0039
## zero_factor(RIDRETH1)2 -0.1156275 0.2854
## zero_factor(RIDRETH1)3 0.1015741 0.0984
## zero_factor(RIDRETH1)4 -0.1620363 0.0859
## zero_factor(RIDRETH1)5 0.1065392 0.2120
## zero_DMDEDUC -0.2025776 0.0586
repv = withReplicates(repdes, quote(
coef(vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), data = nhanes_sxq, crit = "coef",weights=.weights))
))
repv
## theta SE
## (Intercept):1 0.6605042 0.2582
## (Intercept):2 1.8340681 0.1350
## RIDAGEYR:1 -0.0078333 0.0039
## RIDAGEYR:2 -0.0073881 0.0028
## factor(RIDRETH1)2:1 -0.1167894 0.2854
## factor(RIDRETH1)2:2 -0.1073312 0.2471
## factor(RIDRETH1)3:1 0.1019712 0.0983
## factor(RIDRETH1)3:2 0.6551367 0.1857
## factor(RIDRETH1)4:1 -0.1608040 0.0859
## factor(RIDRETH1)4:2 0.6358147 0.1438
## factor(RIDRETH1)5:1 0.1067789 0.2120
## factor(RIDRETH1)5:2 0.4774124 0.2501
## DMDEDUC:1 -0.2023967 0.0586
## DMDEDUC:2 -0.0237389 0.0797
Another way to fit the model using just the survey
package is with svymle
. This takes the log-likelihood and its derivative as arguments, and adds linear predictors to some or all of those arguments. That is, we specify the log-likelihood in terms of the Bernoulli parameter \(p_0\) and the Poisson mean \(\lambda\) -- actually \(\mathrm{logit} p_0\) and \(\eta=\log\lambda\), and also give the derivative with respect to these two parameters. The software does the necessary additional work to put linear predictors on the parameters and give us the zero-inflated model. In fact, svymle
is very similar in underlying approach to vglm
; the difference is that vglm
comes with a large collection of predefined models.
In defining the loglikelihood I'm going to take advantage of the Poisson pmf being available in R. Let's call it \(\digamma(y,\lambda)\). The loglikelihood is \[\ell(y; \mu,p_0)=\log\left(p_0\{y==0\}+(1-p)\digamma(y,\mu)\right)\] only we want it in terms of \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)
loglike = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
log(p*(y==0)+(1-p)*dpois(y,mu))
}
We also need the derivatives with respect to \(\mathrm{logit} p_0\) and \(\eta=\log \lambda\)
dlogitp = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
dexpit = p/(1+p)^2
num = dexpit*(y==0)-dexpit*dpois(y,mu)
denom = p*(y==0)+(1-p)*dpois(y,mu)
num/denom
}
deta = function(y,eta,logitp){
mu = exp(eta)
p = exp(logitp)/(1+exp(logitp))
dmutoy = 0*y
dmutoy[y>0] = exp(-mu[y>0])*mu[y>0]^(y[y>0]-1)/factorial(y[y>0]-1)
num = (1-p)*(-dpois(y,mu)+dmutoy)
denom = p*(y==0)+(1-p)*dpois(y,mu)
num/denom
}
score = function(y,eta,logitp) cbind(deta(y,eta,logitp), dlogitp(y,eta,logitp))
And now we call svymle
giving the linear predictors for both parameters. One of the formulas needs to include the response variable \(Y\).
nlmfit = svymle(loglike=loglike, grad=score, design=des,
formulas=list(eta=malepartners~RIDAGEYR + factor(RIDRETH1) + DMDEDUC,
logitp=~RIDAGEYR + factor(RIDRETH1) + DMDEDUC),
start=coef(unwt), na.action="na.omit",method="BFGS")
summary(nlmfit)
## Survey-sampled mle:
## svymle(loglike = loglike, gradient = score, design = des, formulas = list(eta = malepartners ~
## RIDAGEYR + factor(RIDRETH1) + DMDEDUC, logitp = ~RIDAGEYR +
## factor(RIDRETH1) + DMDEDUC), start = coef(unwt), na.action = "na.omit",
## method = "BFGS")
## Coef SE p.value
## eta.(Intercept) 1.826825789 0.154214277 < 0.001
## eta.RIDAGEYR -0.007800690 0.003014997 0.00967
## eta.factor(RIDRETH1)2 -0.119694280 0.235192596 0.61081
## eta.factor(RIDRETH1)3 0.639831600 0.165176912 < 0.001
## eta.factor(RIDRETH1)4 0.615167292 0.117750580 < 0.001
## eta.factor(RIDRETH1)5 0.465555942 0.213462405 0.02919
## eta.DMDEDUC -0.008130865 0.072679440 0.91092
## logitp.(Intercept) 0.578310169 0.246782567 0.01911
## logitp.RIDAGEYR -0.006077533 0.004017016 0.13029
## logitp.factor(RIDRETH1)2 -0.033440316 0.280701007 0.90517
## logitp.factor(RIDRETH1)3 0.124435365 0.095140203 0.19090
## logitp.factor(RIDRETH1)4 -0.151762524 0.086322705 0.07873
## logitp.factor(RIDRETH1)5 0.119530077 0.209380275 0.56808
## logitp.DMDEDUC -0.209112828 0.053553191 < 0.001
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR,
## nest = TRUE, data = nhanes_sxq)
Finally, we use svy_vglm
, with variances by linearisation
library(svyVGAM)
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=des, crit = "coef")
## Stratified 1 - level Cluster Sampling design (with replacement)
## With (30) clusters.
## svydesign(id = ~SDMVPSU, strat = ~SDMVSTRA, weights = ~WTINT2YR,
## nest = TRUE, data = nhanes_sxq)
##
## Call:
## vglm(formula = formula, family = family, data = surveydata, weights = .survey.prob.weights,
## crit = "coef")
##
##
## Coefficients:
## (Intercept):1 (Intercept):2 RIDAGEYR:1 RIDAGEYR:2
## 0.660504243 1.834068104 -0.007833317 -0.007388071
## factor(RIDRETH1)2:1 factor(RIDRETH1)2:2 factor(RIDRETH1)3:1 factor(RIDRETH1)3:2
## -0.116789371 -0.107331190 0.101971159 0.655136725
## factor(RIDRETH1)4:1 factor(RIDRETH1)4:2 factor(RIDRETH1)5:1 factor(RIDRETH1)5:2
## -0.160804047 0.635814748 0.106778915 0.477412443
## DMDEDUC:1 DMDEDUC:2
## -0.202396659 -0.023738881
##
## Degrees of Freedom: 5050 Total; 5036 Residual
## Log-likelihood: -493703966
and by replicate weights
svy_vglm(malepartners~RIDAGEYR+factor(RIDRETH1)+DMDEDUC, zipoisson(), design=repdes, crit = "coef")