The smerc package is focused on Statistical Methods for Regional Counts. It particularly focuses on the spatial scan method (Kulldorff, 1997) and many of its extensions.
In what follows, we demonstrate some of the basic functionality of the smerc package using 281 observations related to leukeumia cases in an 8 county area of the state of New York. The data were made available in Waller and Gotway (2005) and details are provided there.
The leukemia data are available in the nydf
dataframe. Each row of the dataframe contains information regarding a different region. Each row of the dataframe includes longitude
and latitude
information for the centroid of each region, transformed x
and y
coordinates available in the original dataset provided by Waller and Gotway (2005), as well as the population
of each region and the number of leukemia cases
.
library(smerc) # load package
## # This research was partially supported under NSF grant 1463642
data(nydf) # load data
str(nydf) # look at structure
## 'data.frame': 281 obs. of 6 variables:
## $ longitude : num -75.9 -75.9 -75.9 -75.9 -75.9 ...
## $ latitude : num 42.1 42.1 42.1 42.1 42.1 ...
## $ population: num 3540 3560 3739 2784 2571 ...
## $ cases : num 3.08 4.08 1.09 1.07 3.06 ...
## $ x : num 4.07 4.64 5.71 7.61 7.32 ...
## $ y : num -67.4 -66.9 -67 -66 -67.3 ...
A SpatialPolygonsDataFrame
associated with the regions is included in the nypoly
data set, which we plot below.
data(nypoly) # load nypoly data
library(sp) # load sp package for plotting
plot(nypoly) # plot SpatialPolygonsDataFrame
## Warning in wkt(obj): CRS object has no comment
The spatial scan test can be perfomed using the scan.test
function. The user must supply the coordinates of the region centroids, the number of cases in each region, and the population (i.e., the number persons at risk). There are many additional optional arguments described in the documentation. In the following example, we reduce the number of Monte Carlo simulations used to estimate the p-value.
= nydf[,c("x", "y")] # extract coordinates
coords = nydf$cases # extract cases
cases = nydf$population # extract population
pop = scan.test(coords, cases, pop, nsim = 99) # perform scan test scan_out
## computing statistics for simulated data:
The scan.test
function (and most of the other *.test
functions in the smerc package) will return an object of class smerc_cluster
.
class(scan_out)
## [1] "smerc_cluster"
A smerc_cluster
object has a default print option that summarizes relevant details about the object produced by the function.
In this case, we receive the following:
# print scan.test results
scan_out ## method: circular scan
## statistic: poisson
## simulation: multinomial
## realizations: 99
## population upperbound: 50%
## minimum cases: 2
In this particular case, we see that the scan_out
object summarizes the results of the circular scan
method using a poisson
statistic (the other choice is binomial
). The Monte Carlo p-value was estimated using 99
realizations from a multinomial
distribution (poisson
and binomial
are the other possibilities). The candidate zones were limited to having no more than 50%
of the total population and at least 2
cases had to be in a candidate zone to be considered a cluster. Other methods will provide some of the same information, but some of the information will be different based on the relevant parameters of the models.
A smerc_cluster
also has a summary
function that summarizes the significant (or most likely) clusters returned by the method.
summary(scan_out) # summarize scan.test results
## nregions max_dist cases ex rr stat p
## 1 24 12.3 95.33108 55.8 1.8 13.1 0.01
## 2 11 26.5 49.71990 27.1 1.9 8.0 0.04
The summary
of scan_out
reveals that there were two significant clusters. The first cluster was comprised of 24 regions with a maximum distance of 12.3 between centroids. The number of cases observed in the cluster is 95.33, while 55.8 cases were expected. The relative risk of the cluster is 1.8, with an associated test statistic of 13.1, and a Monte Carlo p-value of 0.01. A second cluster is also summarized with similar information.
A very basic plot
mechanism is provided for smerc_cluster
objects. The plot will show the locations of every centroid, with significant clusters shown with different colors and symbols. An attempt is also made to show the connectivity of the regions. In the plot below, we see the two significant clusters shown in yellow (middle) and purple (bottom).
plot(scan_out) # basic plot of scan.test results
If you want to extract the detected clusters from a smerc_cluster
object, then the clusters
function can be used. The function returns a list: each element of the list is a vector with the indices of the regions included in the cluster. The first element of the list contains the regions comprising the most likely cluster, the second element contains the regions for the second most likely cluster, etc.
clusters(scan_out)
## [[1]]
## [1] 52 50 49 48 15 16 1 51 37 38 47 2 14 39 53 13 34 40 3 44 17 43 12 46
##
## [[2]]
## [1] 88 86 87 89 92 91 85 93 84 90 259
If a polygon-like structure is associated with each region, then the color.clusters
can be used to easily produce nicer plots. E.g., we can use the nypoly
object to create a nicer map of our results.
plot(nypoly, col = color.clusters(scan_out)) #nicer plot of scan.test results
## Warning in wkt(obj): CRS object has no comment
Other scan methods available include:
elliptic.test
}.flex.test
}.rflex.test
}.uls.test
}.dmst.test
}.edmst.test
}.dc.test
}.mlink.test
}.fast.test
}.mlf.test
}.Other non-scan method for the detection of clusters and/or clustering of cases for regional data are provided.
bn.test
performs the Besag-Newell test (Besag and Newell, 1991) and returns a smerc_cluster
. The print
, summary
, and plot
methods are available, as before.
= bn.test(coords = coords, cases = cases, pop = pop, cstar = 20,
bn_out alpha = 0.01) # perform besag-newell test
# print results
bn_out ## method: Besag-Newell
## case radius: 20
## modified p-value: FALSE
summary(bn_out) # summarize results
## nregions max_dist cases ex rr stat p
## 1 5 3.0 20.42516 10.2 2.0 5 0.004132413
## 2 5 13.4 20.15705 10.5 1.9 5 0.006017272
## 3 3 2.0 20.39443 10.8 1.9 3 0.008039295
## 4 5 2.7 25.45904 11.0 2.4 5 0.009113555
## 5 5 2.4 20.29863 11.1 1.9 5 0.009962904
plot(bn_out) # plot results
The Cluster Evaluation Permutation Procedure (CEPP) proposed by Turnbull et al. (1990) can be performed using cepp.test
. The function returns a smerc_cluster
object with print
, summary
, and plot
functions.
# perform CEPP test
= cepp.test(coords = coords, cases = cases, pop = pop,
cepp_out nstar = 5000, nsim = 99, alpha = 0.1)
# print results
cepp_out ## method: CEPP
## simulation: multinomial
## realizations: 99
## population radius: 5000
summary(cepp_out) # summarize results
## nregions max_dist cases ex rr stat p
## 1 2 3.6 9.589796 2.8 3.5 9.6 0.08
plot(cepp_out) # plot results
Tango’s clustering detection test (Tango, 1995) can be performed via the tango.test
function. The dweights
function can be used to construct the weights for the test. The tango.test
function produces and object of class tango
, which has a native print
function and a plot
function that compares the goodness-of-fit and spatial autocorrelation components of Tango’s statistic for the observed data and the simulated data sets.
= dweights(coords, kappa = 1) # construct weights matrix
w = tango.test(cases, pop, w, nsim = 49) # perform tango's test
tango_out # print results
tango_out ## method: Tango's index
## index: 0.0029
## goodness-of-fit component: 0.0026
## spatial autocorrelation component: 0.00033
## chi-square statistic: 210
## chi-square df: 110
## chi-square p-value: 1.4e-08
## Monte Carlo p-value: 0.02
plot(tango_out) # plot results
Nearly all cluster detection methods have both a *.zones
function that returns all the candidate zones for the method and a *.sim
function that is used to produce results for data simulated under the null hypothesis. These are unlikely to interest most users, but may be useful for individuals wanting to better understand how these methods work or are interested in developing new methods based off existing ones. The *.sim
functions are not meant for general use, as they are written to take very specific arguments that the user could easily misuse.
As an example, the following code produces all elliptical-shaped zones considered by the elliptic scan method (Kulldorff et al., 2006) using a population upperbound of no more than 10%. The function returns a list with the 248,213 candidate zones, as well as the associated shape and angle.
# obtain zones for elliptical scan method
= elliptic.zones(coords, pop, ubpop = 0.1)
ezones # view structure of ezones
str(ezones)
## List of 3
## $ zones:List of 239228
## ..$ : int 1
## ..$ : int [1:2] 1 2
## ..$ : int [1:3] 1 2 15
## ..$ : int [1:4] 1 2 15 49
## ..$ : int [1:5] 1 2 15 49 50
## ..$ : int [1:6] 1 2 15 49 50 13
## ..$ : int [1:7] 1 2 15 49 50 13 3
## ..$ : int [1:8] 1 2 15 49 50 13 3 14
## ..$ : int [1:9] 1 2 15 49 50 13 3 14 48
## ..$ : int [1:10] 1 2 15 49 50 13 3 14 48 47
## ..$ : int [1:11] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:12] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:13] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:14] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:15] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:16] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:17] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:18] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:19] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:20] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:21] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:22] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:23] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:24] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:25] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:26] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:27] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:28] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int [1:29] 1 2 15 49 50 13 3 14 48 47 ...
## ..$ : int 2
## ..$ : int [1:3] 2 1 3
## ..$ : int [1:4] 2 1 3 13
## ..$ : int [1:5] 2 1 3 13 15
## ..$ : int [1:6] 2 1 3 13 15 14
## ..$ : int [1:7] 2 1 3 13 15 14 49
## ..$ : int [1:8] 2 1 3 13 15 14 49 47
## ..$ : int [1:9] 2 1 3 13 15 14 49 47 12
## ..$ : int [1:10] 2 1 3 13 15 14 49 47 12 50
## ..$ : int [1:11] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:12] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:13] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:14] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:15] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:16] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:17] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:18] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:20] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:21] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:22] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:23] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:24] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:25] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:27] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int [1:28] 2 1 3 13 15 14 49 47 12 50 ...
## ..$ : int 3
## ..$ : int [1:2] 3 13
## ..$ : int [1:3] 3 13 2
## ..$ : int [1:4] 3 13 2 12
## ..$ : int [1:5] 3 13 2 12 14
## ..$ : int [1:6] 3 13 2 12 14 5
## ..$ : int [1:7] 3 13 2 12 14 5 1
## ..$ : int [1:8] 3 13 2 12 14 5 1 10
## ..$ : int [1:9] 3 13 2 12 14 5 1 10 11
## ..$ : int [1:10] 3 13 2 12 14 5 1 10 11 4
## ..$ : int [1:11] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:12] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:13] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:14] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:15] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:16] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:17] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:18] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:19] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:20] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:21] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:22] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:23] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:24] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:25] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:26] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:27] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int [1:29] 3 13 2 12 14 5 1 10 11 4 ...
## ..$ : int 4
## ..$ : int [1:2] 4 6
## ..$ : int [1:3] 4 6 5
## ..$ : int [1:4] 4 6 5 12
## ..$ : int [1:5] 4 6 5 12 35
## ..$ : int [1:6] 4 6 5 12 35 7
## ..$ : int [1:7] 4 6 5 12 35 7 10
## ..$ : int [1:8] 4 6 5 12 35 7 10 3
## ..$ : int [1:9] 4 6 5 12 35 7 10 3 11
## ..$ : int [1:10] 4 6 5 12 35 7 10 3 11 9
## ..$ : int [1:11] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:12] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:13] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:14] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:15] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:16] 4 6 5 12 35 7 10 3 11 9 ...
## ..$ : int [1:17] 4 6 5 12 35 7 10 3 11 9 ...
## .. [list output truncated]
## $ shape: num [1:239228] 1 1 1 1 1 1 1 1 1 1 ...
## $ angle: num [1:239228] 90 90 90 90 90 90 90 90 90 90 ...
Assuncao, R.M., Costa, M.A., Tavares, A. and Neto, S.J.F. (2006). Fast detection of arbitrarily shaped disease clusters, Statistics in Medicine, 25, 723-742.
Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases, Journal of the Royal Statistical Society, Series A, 154, 327-333.
Costa, M.A. and Assuncao, R.M. and Kulldorff, M. (2012) Constrained spanning tree algorithms for irregularly-shaped spatial clustering, Computational Statistics & Data Analysis, 56(6), 1771-1783.
Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics - Theory and Methods, 26(6): 1481-1496,
Kulldorff, M., Huang, L., Pickle, L. and Duczmal, L. (2006) An elliptic spatial scan statistic. Statististics in Medicine, 25:3929-3943.
Neill, D. B. (2012), Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74: 337-360.
Patil, G.P. & Taillie, C. Upper level set scan statistic for detecting arbitrarily shaped hotspots. Environmental and Ecological Statistics (2004) 11(2):183-197.
Tango, T. (1995) A class of tests for detecting “general” and “focused” clustering of rare diseases. Statistics in Medicine. 14, 2323-2334.
Tango, T., & Takahashi, K. (2005). A flexibly shaped spatial scan statistic for detecting clusters. International journal of health geographics, 4(1), 11. Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics – Theory and Methods 26, 1481-1496.
Tango, T. and Takahashi, K. (2012), A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters. Statist. Med., 31: 4207-4218.
Turnbull, Bruce W., Iwano, Eric J., Burnett, William S., Howe, Holly L., Clark, Larry C. (1990). Monitoring for Clusters of Disease: Application to Leukemia Incidence in Upstate New York, American Journal of Epidemiology, 132(supp1):136-143.
Waller, L.A. and Gotway, C.A. (2005). Applied Spatial Statistics for Public Health Data. Hoboken, NJ: Wiley.
Yao, Z., Tang, J., & Zhan, F. B. (2011). Detection of arbitrarily-shaped clusters using a neighbor-expanding approach: A case study on murine typhus in South Texas. International journal of health geographics, 10(1), 1.