In many geographical studies variables are cross-correlated, i.e. in case of positive correlation high and low trend occur simultaneously. As a result uncertainty in one variable may be statistically dependent on uncertainty in the other. For example, the spatial properties of soil, such as organic carbon (OC) and nitrogen (N) content are cross-correlated. These variables are used to derive C/N ratio, vital information to evaluate soil management and to increase the crop productivity, but maps of OC and N are approximations encumbered with errors. These errors will propagate through the calculation into the C/N prediction.
We can use the Monte Carlo (MC) method to analyse how the error propagates through spatial operations and models (briefly described in the next section). The MC method is fairly straightforward in application, but in case of spatially distributed cross-correlated variables like OC and N one should consider taking spatial cross-correlation into account. That is because the model output (e.g. C/N) may be influenced by the spatial cross-correlation in the input.
The adapted uncertainty propagation analysis approach is based on the Monte Carlo method that computes the output of the model repeatedly, with input values that are randomly sampled from their joint probability distribution function (pdf). The set of model outputs forms a random sample from the output pdf, so that the parameters of the distribution, such as the mean, variance and quantiles, can be estimated from the sample. The method thus consists of the following steps:
Note that the above ignores uncertainty in model parameters and model structure, but these can easily be included if available as pdfs. A random sample from the model inputs can be obtained using an appropriate pseudo-random number generator.
For each uncertain spatially distributed continuous variable, such as soil organic carbon content, rainfall or elevation we assume the following geostatistical model:
Z(x)= μ(x)+ σ(x)∙ε(x)
where x is geographic location, μ is the (deterministic) mean of Z, σ is its standard deviation and ε is is standard normal, second-order stationary stochastic residual, whose spatial autocorrelation is modelled with a semivariogram or correlogram. Note that ε has zero mean and unit variance. Both μ and σ may vary in space so that spatial trends and spatially variable uncertainty can be taken into account. The cross-correlations are modelled using a linear model of co-regionalization (Wackernagel, 2003). The random sample is drawn from the pdf of ε to further calculate a sample from Z.
The example data for C/N calculations are a 250m resolution mean OC and TN (total N) of a 33km x 33km area adjacent to lake Alaotra in Madagascar.
The ‘Madagascar’ dataset contains four spatial objects: a mean OC and TN of the area and their standard deviations. It also has a saved function that calculates C/N using OC and TN that will be used later.
# load packages
library(spup)
library(raster)
library(purrr)
# set seed
set.seed(12345)
# load and view the data
data(OC, OC_sd, TN, TN_sd)
par(mfrow = c(1,2))
class(OC)
[1] "RasterLayer"
attr(,"package")
[1] "raster"
[1] "RasterLayer"
attr(,"package")
[1] "raster"
OC_Madagaskar_xd1_250m
Min. 7.00000
1st Qu. 14.33333
Median 23.00000
3rd Qu. 39.66667
Max. 103.00000
NA's 1545.00000
TN_Madagaskar_xd1_250m
Min. 0.609
1st Qu. 1.000
Median 1.320
3rd Qu. 2.080
Max. 3.740
NA's 1545.000
The first step in uncertainty propagation analysis is to define an uncertainty model for the uncertain input variables, here OC and TN, that will be used in the Monte Carlo uncertainty propagation analysis.
First, the marginal uncertainty model is defined for each variable separately, and next the joint uncertainty model is defined for the variables together.
In case of OC and TN, the ε are spatially correlated and in order to include this in the analysis, we need to describe it by spatial correlograms. For each of the variables, the makeCRM()
function collates all necessary information into a list.
Let us assume that the spatial autocorrelation of the OC and TN errors is an are described by spherical correlation function with a short-distance correlation of 0.6 for OC and 0.4 for TN, and a range parameter of 1000m. It is important at this step to ensure that the correlation function types as well as the ranges are the same for each variable. It is a requirement for further analysis, becasue spup employs the model of co-regionalization (Wackernagel, H. 2003).
# define spatial correlogram models
OC_crm <- makeCRM(acf0 = 0.6, range = 5000, model = "Sph")
TN_crm <- makeCRM(acf0 = 0.4, range = 5000, model = "Sph")
We can view the correlograms by plotting them.
Spatial correlograms summarise patterns of spatial autocorrelation in data and model residuals. They show the degree of correlation between values at two locations as a function of the separation distance between the locations. In the case above the correlation declines with distance, as is usually the case. The correlation is zero for distances greater than 1000m. More about correlograms is included in the DEM vignette.
In order to complete the description of each individual uncertain variable we use the defineUM()
function that collates all information about the OC and TN uncertainty. The minimum information required is:
?defineUM
,# define uncertainty model for the OC and TN
OC_UM <- defineUM(TRUE, distribution = "norm", distr_param = c(OC, OC_sd), crm = OC_crm, id = "OC")
TN_UM <- defineUM(TRUE, distribution = "norm", distr_param = c(TN, TN_sd), crm = TN_crm, id = "TN")
class(OC_UM)
[1] "MarginalNumericSpatial"
[1] "MarginalNumericSpatial"
Both variables are of the same class “MarginalNumericSpatial”. This is one of the requirements for defining a multivariate uncertainty model next. We use the defineMUM()
function to collate information about uncertainty in each variable as above, and information about their cross-correlation. The required function arguments are:
defineUM()
) for the selected variables,The correlation matrix must satisfy a range of conditions: - square, - symmetrical (transposed must be the same as original), - diagonal must all be 1, - all values must belong to [-1, +1] range, - it has to be positive-definite (all eigenvalues must be > 0).
# define multivariate uncertainty model
mySpatialMUM <- defineMUM(UMlist = list(OC_UM, TN_UM),
cormatrix = matrix(c(1,0.7,0.7,1), nrow=2, ncol=2))
class(mySpatialMUM)
[1] "JointNumericSpatial"
Generating possible realities of the selected variables can be completed by using the genSample()
function. The required information to pass to the function includes:
?genSample()
for more details.Additional parameters may be also specified. For example, sampling of spatially correlated variable is based on the ‘gstat’ package that allows for limiting the number of nearest observations that should be used for simulation.
# create possible realizations from the joint distribution of OC and TN
OCTN_sample <- genSample(UMobject = mySpatialMUM, n = 3, samplemethod = "ugs", nmax = 20, asList = FALSE)
Linear Model of Coregionalization found. Good.
[using unconditional Gaussian cosimulation]
Note the argument ‘asList’ has been set to FALSE. This indicates that the sampling function will return an object of a class of the distribution parameters class. This is useful if you want to visualize the sample or compute summary statistics quickly.
class : RasterStack
dimensions : 134, 135, 18090, 6 (nrow, ncol, ncell, nlayers)
resolution : 250, 250 (x, y)
extent : 3024625, 3058375, -2514625, -2481125 (xmin, xmax, ymin, ymax)
crs : +proj=laea +lat_0=5 +lon_0=20 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0
names : OC.sim1, OC.sim2, OC.sim3, TN.sim1, TN.sim2, TN.sim3
min values : 6.0066775, 6.1785652, 5.8742527, 0.5427967, 0.5923655, 0.6093069
max values : 111.267412, 118.963186, 117.585139, 3.928214, 4.127825, 4.178326
Usually the sample must be large to obtain stable results. Let us run the sampling to obtain 100 realizations. This may take a minute.
# create possible realizations from the joint
# distribution of OC and TN
MC <- 500
OCTN_sample <- genSample(UMobject = mySpatialMUM, n = MC,
samplemethod = "ugs", nmax = 20, asList = FALSE)
Linear Model of Coregionalization found. Good.
[using unconditional Gaussian cosimulation]
We can view the means and standard deviations of the sampled OC and TN. If the number of samples was very large then the mean of the sample of each would equal the mean OC and TN, and the sd would equal their sds.
# compute and plot OC and TN sample statistics
# e.g. mean and standard deviation
OC_sample <- OCTN_sample[[1:MC]]
TN_sample <- OCTN_sample[[(MC+1):(2*MC)]]
OC_sample_mean <- mean(OC_sample)
TN_sample_mean <- mean(TN_sample)
OC_sample_sd <- calc(OC_sample, fun = sd)
TN_sample_sd <- calc(TN_sample, fun = sd)
par(mfrow= c(1,2))
plot(OC_sample_mean, main = "Mean of OC realizations")
plot(TN_sample_mean, main = "Mean of TN realizations")
We can also view the cross-corelations between two variables.
# R package GGally provides a nice option for plotting correlations
library(GGally)
# octn <- cbind(as.data.frame(OC_sample[[1]]), as.data.frame(TN_sample[[1]]))
octn <- cbind(as.data.frame(t(OC_sample[1,1,])), as.data.frame(t(TN_sample[1,1,])),
as.data.frame(t(OC_sample[1,2,])), as.data.frame(t(TN_sample[1,2,])),
as.data.frame(t(OC_sample[10,10,])), as.data.frame(t(TN_sample[10,10,])))
colnames(octn) <- c("OC_loc1", "TN_loc1", "OC_loc2", "TN_loc2", "OC_loc3", "TN_loc3")
ggscatmat(data = octn, alpha=0.15)
In order to perform uncertainty propagation analysis using ‘spup’, the model through which uncertainty is propagated needs to be defined as an R function. A pre-defined model that calculates C/N using OC and TN as input is provided.
function (OC, TN)
{
OC/TN
}
The propagation of uncertainty occurs when the model is run with the uncertain inputs. Running the model with a sample of realizations of uncertain input variable(s) yields an equally large sample of model outputs that can be further analyzed. To run the C/N ratio model with the OC and TN realizations we use the propagate()
function. The propagate()
function takes as arguments:
In order to run the propagation function the samples of uncertain input variables must be saved in lists and then collated into a list of these lists. We can either coerce the existing OCTN_sample object or get it automatically setting up the ‘asList’ argument of genSample()
to TRUE.
# coerce a raster stack to a list
# in our example we consider two variables, so we need a list of two lists with realizations
l <- list()
l[[1]] <- map(1:100, function(x){OCTN_sample[[x]]})
l[[2]] <- map(101:200, function(x){OCTN_sample[[x]]})
OCTN_sample <- l
# or sample from uncertain input and return it automatically in a list by setting asList argument to TRUE (default)
OCTN_sample <- genSample(UMobject = mySpatialMUM, n = MC, samplemethod = "ugs", nmax = 20, asList = TRUE)
Linear Model of Coregionalization found. Good.
[using unconditional Gaussian cosimulation]
# run uncertainty propagation
CN_sample <- propagate(realizations = OCTN_sample,
model = C_N_model_raster, n = MC)
We can now view the sample of model output realizations (i.e. C/N) and visualize uncertainty by calculating and plotting the sample mean and standard deviation. In our case we need to coerce the output of the propagate()
function saved as a list back to a RasterStack.
# coerce C/Ns list to a RasterStack
CN_sample <- stack(CN_sample)
names(CN_sample) <- paste("CN.", c(1:nlayers(CN_sample)), sep = "")
# view the sample of the model output
par(mfrow = c(1,1))
plot(CN_sample[[1:6]])
# compute and plot the slope sample statistics
# e.g. mean and standard deviation
CN_mean <- mean(CN_sample)
CN_sd <- calc(CN_sample, fun = sd)
par(mfrow = c(1,2))
plot(CN_mean, main = "C/N mean")
plot(CN_sd, main = "C/N sd")
We can also view C/N realizations at specific locations, for example where its mean is highest and lowest:
l_mean <- mean(CN_sample[which.min(OC)])
l_sd <- sd(CN_sample[which.min(OC)])
h_mean <- mean(CN_sample[which.max(OC)])
h_sd <- sd(CN_sample[which.max(OC)])
par(mfrow = c(1,2))
hist(CN_sample[which.min(OC)], main = paste("C/N at lowest OC,", "\n",
"mean = ", round(l_mean, 2), ", sd = ", round(l_sd, 2), sep = ""), xlab = "C/N")
hist(CN_sample[which.max(OC)], main = paste("C/N at highiest OC,", "\n",
"mean = ", round(h_mean, 2), ", sd = ", round(h_sd, 2), sep = ""), xlab = "C/N")
We can also look at specific quantiles of the C/N ratio sample. The function quantile_MC_sgdf
implemented in ‘spup’ allow us to do it quickly.
# calculate quantiles
CN_sample_df <- as(CN_sample, "SpatialGridDataFrame")
CN_q <- quantile_MC_sgdf(CN_sample_df, probs = c(0.1, 0.25, 0.75, 0.9), na.rm = TRUE)
spplot(CN_q[c(3,4,1,2)], main = list(label = "Quantiles of C/N realizations", cex = 1))
We may also identify all locations where the C/N ratio is smaller than 24, with 90% probability. This information might be used by farmers to identify which plots require action on improving soil quality.
CN_q$good4crops <- factor(ifelse(CN_q$prob10perc > 20, 1, 0), labels = c("<90% certain", ">90% certain"))
spplot(CN_q, "good4crops", col.regions = c("firebrick1", "darkolivegreen2"), main = "Areas with sufficient C/N for cropping")
We can also calculate OC and TN uncertainty contribution to the uncertainty in predicting C/N. We have already calculated the total uncertainty in C/N predictions assuming both OC and TN are uncertain. To identify their contributions we can run the propagation with only one of them being uncertain. For the other input we take its mean as certain information.
# calculate total variance as a measure of total uncertainty
CN_tot_var <- calc(CN_sample, fun = var)
# OC contribution
OC_sample <- OCTN_sample[[1]]
CN_sample_oc <- propagate(realizations = OC_sample, model = C_N_model_raster, TN = TN, n = MC)
CN_sample_oc <- stack(CN_sample_oc)
CN_oc_var <- calc(CN_sample_oc, fun = var)
OC_contribution <- (CN_oc_var/CN_tot_var)*100
# TN contribution
TN_sample <- OCTN_sample[[2]]
CN_sample_tn <- propagate(realizations = TN_sample, model = C_N_model_raster, OC = OC, n = MC)
CN_sample_tn <- stack(CN_sample_tn)
CN_tn_var <- calc(CN_sample_tn, fun = var)
TN_contribution <- (CN_tn_var/CN_tot_var)*100
# plot results
par(mfrow = c(2,2))
plot(CN_mean, main = "C/N mean")
plot(CN_sd, main = "C/N sd")
plot(OC_contribution, main = "OC contribution to total C/N var [%]")
plot(TN_contribution, main = "TN contribution to total C/N var [%]")
The dataset was derived from the ISRIC Soil Grid database (www.soilgrids.org) (Hengl et al., 2017).
This project has received funding from the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no 607000.
HENGL, T., MENDES DE JESUS, J., HEUVELINK, G. B. M., RUIPEREZ GONZALEZ, M., KILIBARDA, M., BLAGOTIĆ, A., SHANGGUAN, W., WRIGHT, M. N., GENG, X., BAUER-MARSCHALLINGER, B., GUEVARA, M. A., VARGAS, R., MACMILLAN, R. A., BATJES, N. H., LEENAARS, J. G. B., RIBEIRO, E., WHEELER, I., MANTEL, S. & KEMPEN, B. 2017. SoilGrids250m: Global gridded soil information based on machine learning. PLOS ONE, 12, e0169748.
WACKERNAGEL, H. 2003. Multivariate Geostatistics: An Introduction with Applications, Springer.