Predicting the distribution of species across space and time is a fundamental piece for answering a multitude of questions surrounding the field of ecology. Consequentially, numerous statistical tools have been developed to assist ecologists in making these types of predictions; the most common of which is certainly species distribution models (SDMs).
These types of models are heavily reliant on the quality and quantity of data available across multiple time periods in order to produce meaningful results. Thankfully, the \(21^{st}\) century has been associated with a boom in digital technology and the collapse of data storage costs achieved, allowing us to see data on the environment and species occurrences grow at unprecedented rates. The problem however is that all these data are disparate in nature; they all have their own sampling protocols, assumptions and attributes, making joint analysis of them difficult.
On account of this issue, the field of integrated species distribution modelling has progressed considerably over the last few years – and now the associated methods are well-developed, and the benefits of using such models (over single dataset or simple data-pooling models) are apparent.
A major handicap to integrated modeling lies in the fact that the tools required to make inference with these models have not been established – especially with regards to general software to be used by ecologists. This in turn has stagnated the growth of applying these models to real data.
In light of this impediment to the field of integrated modelling, we decided to make PointedSDMs, an R package to help simplify the modelling process of integrated species distribution models (ISDMs) for ecologists. It does so by using the INLA framework (Rue, Martino, and Chopin 2009) to approximate the models and by constructing wrapper functions around those provided in the R package, inlabru (Bachl et al. 2019).
This R markdown file illustrates an example of modelling an ISDM with PointedSDMs, using five disparate datasets containing species Setophaga collected around Pennsylvania state (United States of America). The first part of this file contains a brief introduction to the statistical model used in ISDMs, as well as an introduction to the included functions in the package. The second part of this file uses PointedSDMs to run the ISDM for the provided data. We note that, for this particular vignette, no inference was made due to computational intensity of the model. However the R script and data are provided below so that the user may carry out inference.
##Install if need be
library(PointedSDMs)
The goal of our statistical model is to infer the “true” distribution of our species’ using the observations we have at hand, as well as environmental covariates describing the studied area, \(\boldsymbol{X}\), and other parameters describing the species, \(\boldsymbol{\theta}\). To do so, we construct a hierarchical state-space model composing of two parts: the underlying process model as well as observation models.
The process model is a stochastic process which describes how the points are distributed in space. The most commonly assumed process model is the Log-Gaussian Cox process (LGCP), which is characterized by the intensity function \(\lambda(s)\): such that a higher a higher intensity at a location implies the species is more abundant there, as well as a Gaussian random field used to account for all the environmental covariates not included in the model.
The observation models give a statistical description of the data collection process, and are chosen for each dataset based on their underlying sampling protocols (see (Isaac et al. 2020) for a detailed review of the types of observation models typically used in integrated models). As a result, we can write the likelihood of the observation model for a given dataset, \(Y_i\) as \(\mathcal{L} \left( Y_i \mid \lambda(s), \theta_i \right)\).
Therefore, given a collection of heterogeneous datasets \(\{Y_1, Y_2,...,Y_n\}\), the full-likelihood of the statistical process is given by:\[\mathcal{L} \left( \boldsymbol{Y} \mid \boldsymbol{X}, \boldsymbol{\theta}, \boldsymbol{\phi} \right) = p\left( \lambda(s), \boldsymbol{X}, \boldsymbol{\phi} \right) \times \prod_{i=1}^n \mathcal{L} \left(Y_i \mid \lambda(s), \theta_i\right),\]where \(\boldsymbol\phi\) are parameters assumed for the underlying process.
PointedSDMs is designed to simplify the modelling process of integrated species distribution models by using wrapper functions around those found in R-INLA and inlabru. The simplification is regarding four key steps in the statistical-modelling process:
data preparation,
model fitting,
cross-validation,
prediction.
The first function of interest with regards to these steps is
intModel
, which is designed to create objects required by
inlabru, assign the relevant information to individual objects
used in model specification, as well as structure the provided datasets
together into a homogenized framework.
args(intModel)
The output of this function is an R6 object, and so there
are several slot functions included to help specify the model at a finer
scale. Use ?dataSDM
to get a comprehensive description of
these slot functions.
The next function used is fitISDM
which takes the data
object created with intModel
and runs the integrated model.
The output of this function is in essence an inlabru model
output with additional information to be used by the sequential
functions.
args(fitISDM)
Spatial blocked cross-validation may be performed using the
blockedCV
function – which iteratively calculates and
averages the deviance information criteria (DIC) for models run without
data from the selected spatial block. Before this function is used,
.$spatialBlock()
from the object produced by
intModel
is required in order to specify how the data
should be spatially blocked (see below for an example).
args(blockedCV)
datasetOut
provides cross-validation for integrated
models by running the full model with one less dataset, and calculating
a score measuring the relative importance of the left out dataset in the
full model.
args(datasetOut)
Finally prediction of the full model may be completed using the
predict
function; the arguments here are mostly related to
specifying which components are required to be predicted, however the
function can act identically to the inlabru
predict
function if need be. After predictions are
complete, plots may be made using the generic plot
function.
This example aims to predict the distribution of three species of genus setophaga across Pennsylvania state. This example is notable in integrated modelling since it has been used by two seminal papers in the field, namely those by: Isaac et al. (2020) and Miller et al. (2019). This file extends the example by adding two additional datasets containing a further two species.
The first step in our analysis is to load in the packages required.
library(INLA)
library(inlabru)
library(USAboundaries)
library(sp)
library(sf)
library(raster)
library(rasterVis)
library(ggmap)
library(sn)
library(RColorBrewer)
library(cowplot)
library(knitr)
library(kableExtra)
library(dplyr)
library(spocc)
Finally, additional objects required by PointedSDMs and the R-INLA (Martins et al. 2013) and inlabru (Bachl et al. 2019) packages need to be assembled.
A SpatialPolygons object of Pennsylvania (obtained using the
USAboundaries (rOpenSci 2018)
package) is the first of these objects required, and it will be used to
construct an inla.mesh object as well as help with the plotting
of graphics. In order to create one of these polygons objects, we are
first required to define the projection reference system (see
?CRS
).
<- CRS("+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0")
proj
<- USAboundaries::us_states(states = "Pennsylvania")
PA <- PA$geometry[1]
PA <- as(PA, "Spatial") PA
For our analysis, we used five datasets containing the geocoded locations of our studed species’. The first three of these datasets are present only datasets obtained via eBird through the spocc package (Chamberlain 2021), which is a toolkit used to obtain species’ observation data from a selection of popular online data repositories (GBIF, Vernet, BISON, iNaturalist and eBird). These data were obtained using the following script:
<- c('caerulescens', 'fusca', 'magnolia')
species
<- list()
dataSets for (bird in species) {
<- spocc::occ(
raw_data query = paste('Setophaga', bird),
from = "gbif",
date = c("2005-01-01", "2005-12-31"),
geometry = PA@bbox)$gbif
<- grep("EBIRD", raw_data$data[[paste0('Setophaga_', bird)]]$collectionCode)
rows <- c("longitude", "latitude")
cols <- data.frame(raw_data$data[[paste0('Setophaga_', bird)]][rows, cols])
raw_coords
colnames(raw_coords) <- c("X", "Y")
<- SpatialPointsDataFrame(coords = raw_coords[, c('X', 'Y')],
data_sp data = data.frame(Species_name = rep(bird, nrow(raw_coords))),
proj4string = proj)
paste0('eBird_', bird)]] <- data_sp[!is.na(over(data_sp, PA)),]
dataSets[[
}
The PointedSDMs package also contains two additional structured datasets (BBA, BBS), which were both assumed to be presence absence datasets in the study conducted by Isaac et al. (2020), containing the variable, NPres, denoting the presence (or number of presences in the BBS dataset) at each sampling location. However, we changed the NPres variable name of the BBS dataset to Counts in order to consider it a counts dataset for illustrative purposes. No additional changes were made to the BBA dataset, and so it was considered a presence absence dataset as originally intended. These datasets may be loaded using the following script:
data('SetophagaData')
'BBA']] <- SetophagaData$BBA
dataSets[['BBS']] <- SetophagaData$BBS dataSets[[
Dataset name | Sampling protocol | Number of observations | Species name | Source |
---|---|---|---|---|
BBS | Counts | 45 | Caerulescens | North American Breeding Bird Survey |
BBA | Detection/nondetection | 5165 | Caerulescens | Pennsylvania Breeding Bird Atlas |
eBird_caerulescens | Present only | 264 | Caerulescens | eBird |
eBird_magnolia | Present only | 354 | Magnolia | eBird |
eBird_fusca | Present only | 217 | Fusca | eBird |
Table 1: Table illustrating the different datasets used in the analysis.
Species distribution models study the relationship between our in-situ species observations and the underlying environment, and so in line with the study conducted by Isaac et al. (2020), we considered two covariates: elevation, describing the height above sea level (meters) and canopy, describing the proportion of tree canopy covered in the area. These two covariates were obtained from the Hollister and Shah (2017) and Bocinsky et al. (2019) R packages respectively. The script to obtain these spatial covariates is provided in the same github repository as the species occurrence data above; the only thing we did differently was keep the covariates as a Raster object, whereas the above script converts them to SpatialPixelsDataFrames.
<- scale(stack(SetophagaData$elev_raster, SetophagaData$NLCD_canopy_raster))
covariates names(covariates) <- c('elevation', 'canopy')
The R-INLA package requires a Delaunay triangulated mesh
used to approximate our spatial random fields, which is created by
supplying the max.edge
, offset
and
cutoff
arguments as well as our SpatialPolygons
map of Pennsylvania to the inla.mesh.2d
function. With this
mesh, intModel
will create integration points required by
our model using inlabru’s ipoints
function.
<- inla.mesh.2d(boundary = inla.sp2segment(PA),
mesh cutoff = 0.2,
max.edge = c(0.1, 0.24),
offset = c(0.1, 0.4))
$crs <- proj
mesh
<- ggplot() +
mesh_plot gg(mesh) +
ggtitle('Plot of mesh') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
mesh_plot
intModel
The intModel
function is the first step in the modeling
of our species data. The aim of the function is to organize and
standardize our data to ensure that each process is correctly modeled in
the integrated framework. For this standardization to work, we specify
the response variable names of the counts data
(responseCounts
) and present absence
(responsePA
), the coordinate names
(Coordinates
) as well as the coordinate reference system
used (Projection
). In addition, the species variable name
(speciesName
) was also specified in order to model the
effects for the species found in the datasets.
<- intModel(dataSets,
spatial_data Coordinates = c('X', 'Y'),
Projection = proj, Mesh = mesh,
responsePA = 'NPres', responseCounts = 'Counts',
spatialCovariates = covariates, speciesName = 'Species_name')
As mentioned above, there are some slot functions now inside the
spatial_data
objects, which allow for a finer level of
customization of the components for the integrated model.
.$plot
Let’s have a look at a graphical representation of the species’
distribution, using the .$plot()
function within
spatial_data
.
$plot(Boundary = FALSE) +
spatial_datagg(PA) +
theme_bw() +
ggtitle('Plot of the datasets') +
theme(plot.title = element_text(hjust = 0.5))
This will create a plot of the location of the species based on their
dataset, however if we wanted to plot by species, we could specify this
with the Species = TRUE
argument.
$plot(Species = TRUE, Boundary = FALSE) +
spatial_datagg(PA) +
theme_bw() +
ggtitle('Plot of the species') +
theme(plot.title = element_text(hjust = 0.5))
.$specifySpatial
When conducting Bayesian analysis, the choice of prior distribution
is imperative. In our example, we chose Penalizing complexity priors
(Simpson et al. 2017), which are designed
to control the spatial range and marginal standard deviation in the
GRF’s Matérn covariance function in order to reduce over-fitting in the
model. This can be done with ease using the
.$specifySpatial
function, which will also update all the
components associated with this field automatically.
$specifySpatial(sharedSpatial = TRUE,
spatial_dataprior.sigma = c(5, 0.01),
prior.range = c(1, 0.01))
.$addBias
Presence only datasets (such as our eBird data) are often
noted to contain numerous biases, which in turn can heavily skew results
obtained from our model. A common approach to account for these biases
is by using a variable in the model such as sampling effort (some
examples include: distance traveled, number of observations per visit,
duration spent observing). However in the absence of such variables,
running a second spatial field in the model can improve performance of
the integrated model (Simmonds et al.
2020). To add a second spatial field to our model, we use the
.$addBias
function, which is part of the intModel
object created above.
$addBias('eBird_caerulescens')
spatial_data$addBias('eBird_fusca')
spatial_data$addBias('eBird_magnolia') spatial_data
.$priorsFixed
Suppose we knew a priori what the mean and precision values
of one of the fixed terms for a given species was: we could add this
knowledge to the model using the .$priorsFixed
function.
$priorsFixed(Effect = 'elevation', Species = 'fusca',
spatial_datamean.linear = 2, prec.linear = 0.05)
.$changeComponents
If we would like to see what the components required by
inlabru in our model are, we can use the
.$changeComponents
function with no arguments
specified.
$changeComponents() spatial_data
This model is thus specified with: two environmental covariates, an
intercept term per species, a spatial field per species, the three bias
fields for the presence only data as well as a shared spatial
field between the different datasets. If we wanted to add/change or
remove a component entirely, this could be done with the
addComponent
and removeComponent
argument
respectively.
.$spatialBlock
.$spatialBlock
is used to set up spatial blocked
cross-validation for the model by assigning each point in the datasets a
block based on where the point is located spatially. For this example,
we chose four blocks (k=4
) for our model, based around a
2x2 grid (rows = 2, cols = 2
). See the figure below for an
illustration of the spatial block: the amount of data within each block
appears reasonably equal.
$spatialBlock(k = 4, rows = 2, cols = 2, plot = TRUE) + theme_bw() spatial_data
blockedCV
Spatial blocked cross-validation may then be produced using the
blockedCV
function. We specify
control.inla = list(int.strategy = 'eb')
in our
options
argument in order to speed up the model running
time.
<- blockedCV(data = spatial_data, options = list(control.inla = list(int.strategy = 'eb'))) spatialBlocked
spatialBlocked
More so, we can compare the cross-validation score from this model to
one without the shared spatial effect (specified with
pointsSpatial = FALSE
in intModel
).
<- intModel(dataSets,
no_fields Coordinates = c('X', 'Y'),
pointsSpatial = NULL,
Projection = proj, Mesh = mesh,
responsePA = 'NPres', responseCounts = 'Counts',
spatialCovariates = covariates, speciesName = 'Species_name')
$spatialBlock(k = 4, rows = 2, cols = 2) no_fields
<- blockedCV(data = no_fields, options = list(control.inla = list(int.strategy = 'eb'))) spatialBlocked_no_fields
spatialBlocked_no_fields
Based on the DIC scores, we conclude that the model with the shared spatial field provides a better fit of the data.
fitISDM
The integrated model is easily run with the fitISDM
function as seen below by supplying the occurrence and R-INLA
objects (created above with intModel
).
<- fitISDM(data = spatial_data,
joint_model options = list(control.inla = list(int.strategy = 'eb')))
<- joint_model$summary.fixed %>%
results_plot mutate(species = gsub('_.*$','',
row.names(joint_model$summary.fixed))) %>%
mutate(coefficient = row.names(joint_model$summary.fixed))
<- ggplot(results_plot, aes(x = coefficient, y = mean)) +
coefficient_plot geom_hline(yintercept = 0, colour = grey(0.25), lty = 2) +
geom_point(aes(x = coefficient,
y = mean)) +
geom_linerange(aes(x = coefficient,
ymin = `0.025quant`,
ymax = `0.975quant`,
col = species),
lwd = 1) +
theme_bw() +
scale_colour_manual(values = c('#003f5c', '#bc5090','#ffa600')) +
theme(legend.position="bottom",
plot.title = element_text(hjust = 0.5)) +
ggtitle("95% credibility intervals of the fixed effects\n
for the three studied species") +
labs(x = 'Variable', y = 'Coefficient value') +
coord_flip()
coefficient_plot
datasetOut
PointedSDMs also includes the function
datasetOut
, which iteratively omits one dataset (and its
associated marks) out of the full model. For example, we can calculate
the effect on the other variables by leaving out the dataset
BBA, which contributes the most occurrences used in our
joint-likelihood model by a significant amount.
Furthermore, by setting predictions = TRUE
we are able
to calculate some cross-validation score by leaving out the selected
dataset. This score is calculated by predicting the covariate values of
the left-out using the reduced model (i.e the model with the dataset
left out), using the predicted values as an offset in a new model, and
then finding the difference between the marginal-likelihood of the full
model (i.e the model with all the datasets considered) and the
marginal-likelihood of the offset model.
<- datasetOut(model = joint_model,
dataset_out dataset = "BBA",
predictions = TRUE)
dataset_out
predict
A significant part of SDMs is creating prediction maps to understand
the species’ spread. Predictions of the integrated SDM’s from
fitISDM
are made easy using the predict
function. By supplying the relevant components, the predict function
will create the formula required by inlabru’s
predict
function to make predictive maps. In this example,
we made predictions for the spatial fields of the species, on the linear
scale using 1000 samples.
<- predict(joint_model, mesh = mesh, mask = PA,
projections spatial = TRUE,
fun = 'linear', n.samples = 1000)
plot
PointedSDMs also provides a general plotting function to
plot the maps using the predicted values obtained in the
predict
function. The plot below shows the predictions of
the three spatial fields of our species.
plot(projections, whattoplot = 'mean',
colourLow = 'orange', colourHigh = 'dark red')