The AQuadtree package provides an automatic aggregation tool to anonymise point data. The framework proposed seeks the data accuracy at the smallest possible areas preventing individual information disclosure. Aggregation and local suppression of point data is performed using a methodology based on hierarchical geographic data structures. The final result is a varying size grid adapted to local area population densities described in @Lagonigro2017.
The grid is created following the guidelines for grid datasets of the GEOSTAT project [@GEOSTAT1B2014] and the INSPIRE grid coding system is adopted as defined in the INSPIRE Data specifications [@INSPIRE2010]. Geospatial specifications use the European Terrestrial Reference System 89, Lambert Azimuthal Equal Area (ETRS89-LAEA) projection [@Annoni2003], although other Coordinate Reference Systems (CRS) and projections are also be used with the package. In the definition of the grid dataset, each cell is identified by a code composed of the cell's size and the coordinates of the lower left cell corner in the ETRS89-LAEA system. The cell's size is denoted in meters (“m”) for cells' sizes up to 1000 meters, or kilometers (“km”) for cells' sizes from 1000 meters and above. To reduce the length of the string, values for northing and easting are divided by 10n (where “n” is the number of zeros in the cell size value measured in meters).
The cell code “1kmN2599E4695“ identifies the 1km grid cell with coordinates of the lower left corner: Y=2599000m, X=4695000m.
The aggregation algorithm implemented in the package builds an initial regular grid of a given cell size, identifying each cell with the corresponding cell code. Each initial cell is recursively subdivided in quadrants where each new cell is assigned a second identifier containing a sequence of numbers to indicate the position of the cell in the disaggregation scheme. For instance, the sequence identifier corresponding to the right top cell in the right image in Figure \ref{fig:Figure 1} would be 416, i.e. fourth cell in the first division, and sixteenth cell in the second division.
To ensure data privacy, a cell is only split if all the resulting subdivisions satisfy the threshold restriction on the number of points. In cases of very irregular point pattern, this restriction results in less accuracy on the cell resolution. For instance, Figure \ref{fig:Figure 2}a presents a pattern of 932 points unevenly distributed on a 1km cell and Figure \ref{fig:Figure 2}b shows the corresponding grid of 62.5m cells with no threshold restrictions (the total number of points aggregated in each cell is shown).
If we define an anonymity threshold of 17, the cell in Figure \ref{fig:Figure 2}a can not be subdivided because one of the four resulting quadrants contains only 4 points. The privacy mechanism aggregates all the points, as presented in Figure \ref{fig:Figure 3}a, and covers an irregular spatial distribution. The AQuadtree algorithm contemplates the suppression of some points before continuing the disaggregation. For instance, suppressing the 4 points in the top right quadrant of Figure \ref{fig:Figure 2}b results in the disaggregation shown in Figure \ref{fig:Figure 3}b, which clearly is much more accurate to the underlying spatial distribution. Moreover, the elimination of more data points would lead to further disaggregation (Figure \ref{fig:Figure 3}c and Figure \ref{fig:Figure 3}d).
In order to balance information loss and resolution accuracy on the process of splitting a cell, the method computes the Theil inequality measure [@theil1972statistical] for the number of points in the possible quadrants as well as the percentage of points needed to be suppressed to force the division. In those cases where the anonymity threshold value prevents disaggregation, high values on the inequality measure may suggest the need for further subdivision, while high values on the loss rate may suggest to stop this subdivision. The algorithm uses default limits for both measures: 0.25 and 0.4. respectively (both values can be defined between 0 and 1). Thus, if there exists any sub-cell with a number of points lower than the anonymity threshold and the inequality measure is higher than 0.25, then the disaggregation process continues by suppressing those points as long as the loss rate is lower than 0.4. Hence, following with example in Figure \ref{fig:Figure 2}, the default disaggregation produced by the method would be the one shown in Figure \ref{fig:Figure 3}b.
All the suppressed points during the process are aggregated in a cell with the initial dimension so their information does not disappear. This cell is marked as a residual cell. Following with the example in Figure \ref{fig:Figure 2}, if the number of suppressed points overcome the anonymity threshold, as for instance, in Figure 3d, the 29 suppressed points are aggregated in a cell of the initial given dimension, which will be marked as a residual cell (see Figure \ref{fig:Figure 4}).
An AQuadtree class object is a spatial dataset representing a varying size grid and is created performing an aggregation of a given set of points considering a minimum threshold for the number of points in each cell. The AQuadtree main function of the package creates the AQuadtree object from SpatialPoints
_ or _SpatialPointsDataFrame
objects.
example.QT <- AQuadtree(CharlestonPop)
class(example.QT)
## [1] "AQuadtree"
## attr(,"package")
## [1] "AQuadtree"
The AQuadtree class proposes a collection of methods to manage the generated objects and overrides the generic methods show
, _print
, _summary
_ and _[
(subsetting) for the AQuadtree signature. The plot
_ method overrides the generic function for plotting R objects with an extra parameter to specify if residual cells should be plotted. The _spplot
_ function overrides the lattice-based plot method from sp package [@Pebesma2005], with two extra parameters to control if residual cells should be displayed, and wether attributes should be divided by the cell areas to make different zones comparable. The _merge
_ method merges data from an input data frame to the given AQuadtree object. The _writeOGR.QT
_ method coerces the given AQuadtree object to a _SpatialPolygonsDataframe
_ and uses the writeOGR method from rgdal package [@bivand2014rgdal] to write out spatial vector data. An AQuadtree object can also be coerced to a SpatialPolygonsDataFrame using the generic method _as
from methods package.
bcn.QT <- AQuadtree(BarcelonaPop)
plot(bcn.QT)
spplot(bcn.QT, by.density = TRUE)
The characteristics of the AQuadtree object can be adjusted with various parameters. First, the dim
_ parameter defines the size in meters of the highest scale cells and the _layers
_ parameter indicates the number of disaggregation levels. Thus, specifying the parameters _dim=10000
_ and _layers=4
would create a grid with cells of sizes between 10km and 1.25km. The default values establish an initial size of 1000 meters and 3 levels of disaggregation.
charleston.QT <- AQuadtree(CharlestonPop, dim = 10000, layers = 4)
summary(charleston.QT)
## Object of class "AQuadtree"
## 180 grid cells with sizes between 10km and 1.25km
## Coordinates:
## min max
## x 2060000 2160000
## y 110000 220000
## Is projected: TRUE
## proj4string:
## +proj=lcc +lat_0=36.6666666666667 +lon_0=-98.5 +lat_1=38.5666666666667
## +lat_2=37.2666666666667 +x_0=400000 +y_0=400000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0
## +units=m +no_defs
## Initial Cell Size: 10km
## Number of valid grid Cells: 176
## Number of residual grid Cells: 4
## Data attributes:
## total
## Min. : 100.0
## 1st Qu.: 157.8
## Median : 226.5
## Mean : 292.2
## 3rd Qu.: 370.2
## Max. :2281.0
The colnames
_ parameter specifies the columns on the original dataset to summarize in the resulting grid. An extra attribute _total
, containing the number of points in each cell is automatically created and added to the dataframe. On the aggregation process, attributes specified in _colnames
parameter will be summarized using the 'sum' function. A list of alternative summarizing functions can can be provided with the _funs
_ parameter. If any attribute indicated in the _colnames
_ parameter is a factor, the function creates a new attribute for each label of the factor. For instance, an attribute sex with two labels, _man
_ and _woman
, would be deployed into the two attributes _sex.man
and _sex.woman
.
class(BarcelonaPop$sex)
## [1] "factor"
levels(BarcelonaPop$sex)
## [1] "man" "woman"
bcn.QT <- AQuadtree(BarcelonaPop, colnames = names(BarcelonaPop), funs = c("mean",
"sum"))
summary(bcn.QT)
## Object of class "AQuadtree"
## 321 grid cells with sizes between 1km and 125m
## Coordinates:
## min max
## x 3659000 3670000
## y 2062500 2074500
## Is projected: TRUE
## proj4string:
## +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## Initial Cell Size: 1km
## Number of valid grid Cells: 313
## Number of residual grid Cells: 8
## Data attributes:
## total age sex.man sex.woman
## Min. : 100 Min. :35.28 Min. : 40.0 Min. : 44.0
## 1st Qu.: 139 1st Qu.:42.37 1st Qu.: 64.0 1st Qu.: 73.0
## Median : 177 Median :44.42 Median : 83.0 Median : 95.0
## Mean : 248 Mean :44.16 Mean :117.1 Mean :130.9
## 3rd Qu.: 328 3rd Qu.:46.18 3rd Qu.:158.0 3rd Qu.:170.0
## Max. :1288 Max. :51.18 Max. :626.0 Max. :662.0
The package applies a default anonymity threshold value of 100 and it can be changed with the threshold
_ parameter. If nothing else is indicated, the threshold restriction is applied only to the total number of points aggregated in each cell (i.e. the _total
_ attribute added to the resulting dataset). When some of the attributes include confidential information, the threshold restriction can be applied to various properties with the _thresholdField
parameter, indicating the list of attributes from the resulting dataset that must satisfy that given threshold.
bcn.QT <- AQuadtree(BarcelonaPop, colnames = c("age", "sex"), funs = c("mean", "sum"),
threshold = 17, thresholdField = c("sex.man", "sex.woman"))
summary(bcn.QT)
## Object of class "AQuadtree"
## 730 grid cells with sizes between 1km and 62.5m
## Coordinates:
## min max
## x 3659000 3670000
## y 2062000 2075000
## Is projected: TRUE
## proj4string:
## +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## Initial Cell Size: 1km
## Number of valid grid Cells: 713
## Number of residual grid Cells: 17
## Data attributes:
## total age sex.man sex.woman
## Min. : 34.0 Min. :32.63 Min. : 17.00 Min. : 17.00
## 1st Qu.: 64.0 1st Qu.:41.52 1st Qu.: 30.25 1st Qu.: 33.00
## Median :103.0 Median :43.87 Median : 49.00 Median : 54.00
## Mean :110.5 Mean :43.71 Mean : 52.25 Mean : 58.21
## 3rd Qu.:140.0 3rd Qu.:46.08 3rd Qu.: 65.75 3rd Qu.: 74.00
## Max. :807.0 Max. :53.46 Max. :371.00 Max. :436.00
In order to control the disaggregation process, two more parameters set the thresholds on the inequity and loss rate. The extra parameter ineq.threshold
, a rate between 0 and 1, specifies a threshold to force disaggregation when there is high inequality between sub-cells. The Theil entropy measure as computed in the _ineq
package [@zeileis2009package] is used to measure inequality for each cell. The _ineq.threshold
_ parameter defaults to 0.25. Lower values in the _ineq.threshold
produce grids with smaller cells (see Figure \ref{fig:Figure 6}).
bcn.QT <- AQuadtree(BarcelonaPop, threshold = 5, ineq.threshold = 0.01)
plot(bcn.QT)
bcn.QT <- AQuadtree(BarcelonaPop, threshold = 5, ineq.threshold = 0.5)
plot(bcn.QT)
On the other side, the parameter loss.threshold
, also a rate between 0 and 1, indicates a rate of loss to prevent disaggregation of cells. A low value states that lower loss is preferred on the resulting grid so less disaggregation is obtained.
A call to the AQuadtree function will return an AQuadtree class object with six slots indicating the parameters used on the creation of the grid:
dim
: scale in meters of the highest level cells layers
: number of subdivision levelscolnames
: attribute names summarized in the resulting grid threshold
: the value used for anonymizationthresholdField
: attribute names to which the threshold restriction has been appliedloss
: number of points discarded during the process of disaggregation because of the thresholdbcn.QT <- AQuadtree(BarcelonaPop)
slotNames(bcn.QT)
## [1] "dim" "layers" "colnames" "threshold"
## [5] "thresholdField" "loss" "data" "polygons"
## [9] "plotOrder" "bbox" "proj4string"
The data slot contains a dataframe with the information comprised in each cell:
total
: number of points grouped in the cell.level
: scale of disaggregation of the cell.residual
: logical value indicating if the cell contains only residual points. Residual points are those that have been suppressed on the disaggregation process to get better accuracy, but can be grouped at the highest scale cell as it overcomes the given threshold. cellCode
: cell's size and the coordinates of the lower left cell corner in the ETRS89-LAEA system at the highest aggregation level.cellNum
: sequence of numbers indicating the position of the cell in the disaggregation scheme.names(bcn.QT)
## [1] "cellCode" "cellNum" "level" "residual" "total"
head(bcn.QT)
## An object of class "AQuadtree" with 6 grid cells with sizes between 1km and 125m
## cellCode cellNum level residual total
## 1 1kmN2064E3665 1 FALSE 313
## 2 1kmN2065E3666 1 FALSE 317
## 3 1kmN2066E3659 1 FALSE 109
## 4 1kmN2066E3660 1 FALSE 434
## 5 1kmN2066E3666 1 FALSE 919
## 6 1kmN2066E3667 1 FALSE 564
The package includes two SpatialPointsDataFrame
_ objects: _BarcelonaPop
_ for the city of Barcelona (Spain) and _CharlestonPop
_ for the Charleston, SC metropolitan area (USA). Both objects contain random point data with the distributions of real data acquired at census scale from different sources.
The package also provides two _SpatialPolygons
_ objects with the spatial boundaries for each region. _BarcelonaCensusTracts
_ and _CharlestonCensusTracts
contain, respectively, the census tracts spatial limits for the city of Barcelona, and the census tracts spatial limits for the Charleston, SC metropolitan area.
BarcelonaPop
comprises 81,359 sample points in the city of Barcelona, Spain. The original information was obtained from the statistics department of the Ajuntament de Barcelona, providing population data at the census tract level for the year 2018 [@AjuntamentdeBarcelona.DepartamentdEstadistica2018]. The points were generated and distributed randomly in space, maintaining unchanged the information at each census tract. To reduce the file size, only a sample of 7% of the points have been maintained.
data("BarcelonaPop", package = "AQuadtree")
summary(BarcelonaPop)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## x 3655447 3669871
## y 2059179 2074546
## Is projected: TRUE
## proj4string :
## [+proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs]
## Number of points: 81359
## Data attributes:
## age sex
## Min. : 0.00 man :38472
## 1st Qu.: 27.00 woman:42887
## Median : 43.00
## Mean : 43.94
## 3rd Qu.: 61.00
## Max. :100.00
In a similar way, the CharlestonPop
object, with 54,619 random sample points, was created using the information in the dataset Charleston1 from the 2000 Census Tract Data for the Charleston, SC metropolitan area (USA) [@GeodaDataandLab2019]. To reduce the file size, only a sample of 10% of the points have been maintained.
data("CharlestonPop", package = "AQuadtree")
summary(CharlestonPop)
## Object of class SpatialPointsDataFrame
## Coordinates:
## min max
## x 2047824.7 2187604.2
## y 102892.3 219129.3
## Is projected: TRUE
## proj4string :
## [+proj=lcc +lat_0=36.6666666666667 +lon_0=-98.5 +lat_1=38.5666666666667
## +lat_2=37.2666666666667 +x_0=400000 +y_0=400000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs]
## Number of points: 54619
## Data attributes:
## ID origin sex age
## 56 : 1274 asian : 761 male :26801 under16:12629
## 22 : 1248 black :16643 female:27818 16_65 :36314
## 59 : 1128 hisp : 1333 over65 : 5676
## 67 : 1076 multi_ra: 669
## 30 : 1032 white :35213
## 70 : 1019
## (Other):47842
Here is the output of session_info(“AQuadtree”) on the system on which this document was compiled:
devtools::session_info("AQuadtree")
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 3.3.2 (2016-10-31)
## os macOS 10.15.6
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate C
## ctype UTF-8
## tz Europe/Berlin
## date 2020-09-07
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## AQuadtree * 1.0.2 2020-09-07 [1] local
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 3.3.2)
## BH 1.62.0-1 2016-11-19 [2] CRAN (R 3.3.2)
## cli 2.0.1 2020-01-08 [2] CRAN (R 3.3.2)
## crayon 1.3.4 2017-09-16 [2] CRAN (R 3.3.2)
## digest 0.6.25 2020-02-23 [2] CRAN (R 3.3.2)
## dplyr * 0.8.4 2020-01-31 [2] CRAN (R 3.3.2)
## ellipsis 0.3.0 2019-09-20 [2] CRAN (R 3.3.2)
## fansi 0.4.1 2020-01-08 [2] CRAN (R 3.3.2)
## glue 1.3.1 2019-03-12 [2] CRAN (R 3.3.2)
## lattice 0.20-34 2016-09-06 [2] CRAN (R 3.3.0)
## magrittr 1.5 2014-11-22 [2] CRAN (R 3.3.0)
## pillar 1.4.3 2019-12-20 [2] CRAN (R 3.3.2)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 3.3.2)
## plogr 0.2.0 2018-03-25 [2] CRAN (R 3.3.2)
## purrr 0.3.3 2019-10-18 [2] CRAN (R 3.3.2)
## R6 2.2.0 2016-10-05 [2] CRAN (R 3.3.0)
## Rcpp 1.0.3 2019-11-08 [2] CRAN (R 3.3.2)
## rlang 0.4.4 2020-01-28 [2] CRAN (R 3.3.2)
## sp * 1.2-4 2016-12-22 [2] CRAN (R 3.3.2)
## tibble 2.1.3 2019-06-06 [2] CRAN (R 3.3.2)
## tidyselect 1.0.0 2020-01-27 [2] CRAN (R 3.3.2)
## utf8 1.1.4 2018-05-24 [2] CRAN (R 3.3.2)
## vctrs 0.2.3 2020-02-20 [2] CRAN (R 3.3.2)
##
## [1] /private/var/folders/cf/7cn764jj39x3yyjcpd863ml00000gn/T/Rtmp1EnIim/Rinsta4616f8556
## [2] /Library/Frameworks/R.framework/Versions/3.3/Resources/library