Introduction

The AQuadtree package provides an automatic aggregation tool to anonymise point data. The framework proposed seeks the data accuracy at the smallest possible areas preventing individual information disclosure. Aggregation and local suppression of point data is performed using a methodology based on hierarchical geographic data structures. The final result is a varying size grid adapted to local area population densities described in @Lagonigro2017.

The grid is created following the guidelines for grid datasets of the GEOSTAT project [@GEOSTAT1B2014] and the INSPIRE grid coding system is adopted as defined in the INSPIRE Data specifications [@INSPIRE2010]. Geospatial specifications use the European Terrestrial Reference System 89, Lambert Azimuthal Equal Area (ETRS89-LAEA) projection [@Annoni2003], although other Coordinate Reference Systems (CRS) and projections are also be used with the package. In the definition of the grid dataset, each cell is identified by a code composed of the cell's size and the coordinates of the lower left cell corner in the ETRS89-LAEA system. The cell's size is denoted in meters (“m”) for cells' sizes up to 1000 meters, or kilometers (“km”) for cells' sizes from 1000 meters and above. To reduce the length of the string, values for northing and easting are divided by 10n (where “n” is the number of zeros in the cell size value measured in meters).

The cell code “1kmN2599E4695“ identifies the 1km grid cell with coordinates of the lower left corner: Y=2599000m, X=4695000m.

The aggregation algorithm implemented in the package builds an initial regular grid of a given cell size, identifying each cell with the corresponding cell code. Each initial cell is recursively subdivided in quadrants where each new cell is assigned a second identifier containing a sequence of numbers to indicate the position of the cell in the disaggregation scheme. For instance, the sequence identifier corresponding to the right top cell in the right image in Figure \ref{fig:Figure 1} would be 416, i.e. fourth cell in the first division, and sixteenth cell in the second division.

$\label{fig:Figure 1}Three level quadtree splitting cell numbering example. Initial cell on the (left); first quadtree subdivision (center); second quadtree subdivision (right)$

To ensure data privacy, a cell is only split if all the resulting subdivisions satisfy the threshold restriction on the number of points. In cases of very irregular point pattern, this restriction results in less accuracy on the cell resolution. For instance, Figure \ref{fig:Figure 2}a presents a pattern of 932 points unevenly distributed on a 1km cell and Figure \ref{fig:Figure 2}b shows the corresponding grid of 62.5m cells with no threshold restrictions (the total number of points aggregated in each cell is shown).

$\label{fig:Figure 2}Set of spatial points (a) and the corresponding 62.5m grid with no threshold restrictions (b) (the numbers indicate the points aggregated in each cell).$ $\label{fig:Figure 2}Set of spatial points (a) and the corresponding 62.5m grid with no threshold restrictions (b) (the numbers indicate the points aggregated in each cell).$

If we define an anonymity threshold of 17, the cell in Figure \ref{fig:Figure 2}a can not be subdivided because one of the four resulting quadrants contains only 4 points. The privacy mechanism aggregates all the points, as presented in Figure \ref{fig:Figure 3}a, and covers an irregular spatial distribution. The AQuadtree algorithm contemplates the suppression of some points before continuing the disaggregation. For instance, suppressing the 4 points in the top right quadrant of Figure \ref{fig:Figure 2}b results in the disaggregation shown in Figure \ref{fig:Figure 3}b, which clearly is much more accurate to the underlying spatial distribution. Moreover, the elimination of more data points would lead to further disaggregation (Figure \ref{fig:Figure 3}c and Figure \ref{fig:Figure 3}d).

$\label{fig:Figure 3}Disaggregation examples with threshold value 17. No disaggregation and no loss (a); disaggregation with suppression of 4 points (b) ; more disaggregation with suppression of 12 points (c); maximum disaggregation with suppression of 29 points (d).$ $\label{fig:Figure 3}Disaggregation examples with threshold value 17. No disaggregation and no loss (a); disaggregation with suppression of 4 points (b) ; more disaggregation with suppression of 12 points (c); maximum disaggregation with suppression of 29 points (d).$ $\label{fig:Figure 3}Disaggregation examples with threshold value 17. No disaggregation and no loss (a); disaggregation with suppression of 4 points (b) ; more disaggregation with suppression of 12 points (c); maximum disaggregation with suppression of 29 points (d).$ $\label{fig:Figure 3}Disaggregation examples with threshold value 17. No disaggregation and no loss (a); disaggregation with suppression of 4 points (b) ; more disaggregation with suppression of 12 points (c); maximum disaggregation with suppression of 29 points (d).$

In order to balance information loss and resolution accuracy on the process of splitting a cell, the method computes the Theil inequality measure [@theil1972statistical] for the number of points in the possible quadrants as well as the percentage of points needed to be suppressed to force the division. In those cases where the anonymity threshold value prevents disaggregation, high values on the inequality measure may suggest the need for further subdivision, while high values on the loss rate may suggest to stop this subdivision. The algorithm uses default limits for both measures: 0.25 and 0.4. respectively (both values can be defined between 0 and 1). Thus, if there exists any sub-cell with a number of points lower than the anonymity threshold and the inequality measure is higher than 0.25, then the disaggregation process continues by suppressing those points as long as the loss rate is lower than 0.4. Hence, following with example in Figure \ref{fig:Figure 2}, the default disaggregation produced by the method would be the one shown in Figure \ref{fig:Figure 3}b.

All the suppressed points during the process are aggregated in a cell with the initial dimension so their information does not disappear. This cell is marked as a residual cell. Following with the example in Figure \ref{fig:Figure 2}, if the number of suppressed points overcome the anonymity threshold, as for instance, in Figure 3d, the 29 suppressed points are aggregated in a cell of the initial given dimension, which will be marked as a residual cell (see Figure \ref{fig:Figure 4}).

$\label{fig:Figure 4}Example of a residual cell.$

The AQuadtree Class

An AQuadtree class object is a spatial dataset representing a varying size grid and is created performing an aggregation of a given set of points considering a minimum threshold for the number of points in each cell. The AQuadtree main function of the package creates the AQuadtree object from SpatialPoints_ or _SpatialPointsDataFrame objects.

example.QT <- AQuadtree(CharlestonPop)
class(example.QT)
## [1] "AQuadtree"
## attr(,"package")
## [1] "AQuadtree"

The AQuadtree class proposes a collection of methods to manage the generated objects and overrides the generic methods show, _print, _summary_ and _[ (subsetting) for the AQuadtree signature. The plot_ method overrides the generic function for plotting R objects with an extra parameter to specify if residual cells should be plotted. The _spplot_ function overrides the lattice-based plot method from sp package [@Pebesma2005], with two extra parameters to control if residual cells should be displayed, and wether attributes should be divided by the cell areas to make different zones comparable. The _merge_ method merges data from an input data frame to the given AQuadtree object. The _writeOGR.QT_ method coerces the given AQuadtree object to a _SpatialPolygonsDataframe_ and uses the writeOGR method from rgdal package [@bivand2014rgdal] to write out spatial vector data. An AQuadtree object can also be coerced to a SpatialPolygonsDataFrame using the generic method _as from methods package.

bcn.QT <- AQuadtree(BarcelonaPop)
plot(bcn.QT)
spplot(bcn.QT, by.density = TRUE)

AQuadtree plot and spplot

Controlling the grid resolution

The characteristics of the AQuadtree object can be adjusted with various parameters. First, the dim_ parameter defines the size in meters of the highest scale cells and the _layers_ parameter indicates the number of disaggregation levels. Thus, specifying the parameters _dim=10000_ and _layers=4 would create a grid with cells of sizes between 10km and 1.25km. The default values establish an initial size of 1000 meters and 3 levels of disaggregation.

charleston.QT <- AQuadtree(CharlestonPop, dim = 10000, layers = 4)
summary(charleston.QT)
## Object of class "AQuadtree"
## 180 grid cells with sizes between 10km and 1.25km 
## Coordinates:
##       min     max
## x 2060000 2160000
## y  110000  220000
## Is projected: TRUE 
## proj4string:
## +proj=lcc +lat_0=36.6666666666667 +lon_0=-98.5 +lat_1=38.5666666666667
## +lat_2=37.2666666666667 +x_0=400000 +y_0=400000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0
## +units=m +no_defs
## Initial Cell Size: 10km 
## Number of valid grid Cells: 176 
## Number of residual grid Cells: 4 
## Data attributes:
##      total       
##  Min.   : 100.0  
##  1st Qu.: 157.8  
##  Median : 226.5  
##  Mean   : 292.2  
##  3rd Qu.: 370.2  
##  Max.   :2281.0

Summarizing data

The colnames_ parameter specifies the columns on the original dataset to summarize in the resulting grid. An extra attribute _total, containing the number of points in each cell is automatically created and added to the dataframe. On the aggregation process, attributes specified in _colnames parameter will be summarized using the 'sum' function. A list of alternative summarizing functions can can be provided with the _funs_ parameter. If any attribute indicated in the _colnames_ parameter is a factor, the function creates a new attribute for each label of the factor. For instance, an attribute sex with two labels, _man_ and _woman, would be deployed into the two attributes _sex.man and _sex.woman.

class(BarcelonaPop$sex)
## [1] "factor"
levels(BarcelonaPop$sex)
## [1] "man"   "woman"
bcn.QT <- AQuadtree(BarcelonaPop, colnames = names(BarcelonaPop), funs = c("mean", 
    "sum"))
summary(bcn.QT)
## Object of class "AQuadtree"
## 321 grid cells with sizes between 1km and 125m 
## Coordinates:
##       min     max
## x 3659000 3670000
## y 2062500 2074500
## Is projected: TRUE 
## proj4string:
## +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## Initial Cell Size: 1km 
## Number of valid grid Cells: 313 
## Number of residual grid Cells: 8 
## Data attributes:
##      total           age           sex.man        sex.woman    
##  Min.   : 100   Min.   :35.28   Min.   : 40.0   Min.   : 44.0  
##  1st Qu.: 139   1st Qu.:42.37   1st Qu.: 64.0   1st Qu.: 73.0  
##  Median : 177   Median :44.42   Median : 83.0   Median : 95.0  
##  Mean   : 248   Mean   :44.16   Mean   :117.1   Mean   :130.9  
##  3rd Qu.: 328   3rd Qu.:46.18   3rd Qu.:158.0   3rd Qu.:170.0  
##  Max.   :1288   Max.   :51.18   Max.   :626.0   Max.   :662.0

Specifying a threshold and threshold fields

The package applies a default anonymity threshold value of 100 and it can be changed with the threshold_ parameter. If nothing else is indicated, the threshold restriction is applied only to the total number of points aggregated in each cell (i.e. the _total_ attribute added to the resulting dataset). When some of the attributes include confidential information, the threshold restriction can be applied to various properties with the _thresholdField parameter, indicating the list of attributes from the resulting dataset that must satisfy that given threshold.

bcn.QT <- AQuadtree(BarcelonaPop, colnames = c("age", "sex"), funs = c("mean", "sum"), 
    threshold = 17, thresholdField = c("sex.man", "sex.woman"))
summary(bcn.QT)
## Object of class "AQuadtree"
## 730 grid cells with sizes between 1km and 62.5m 
## Coordinates:
##       min     max
## x 3659000 3670000
## y 2062000 2075000
## Is projected: TRUE 
## proj4string:
## +proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## Initial Cell Size: 1km 
## Number of valid grid Cells: 713 
## Number of residual grid Cells: 17 
## Data attributes:
##      total            age           sex.man         sex.woman     
##  Min.   : 34.0   Min.   :32.63   Min.   : 17.00   Min.   : 17.00  
##  1st Qu.: 64.0   1st Qu.:41.52   1st Qu.: 30.25   1st Qu.: 33.00  
##  Median :103.0   Median :43.87   Median : 49.00   Median : 54.00  
##  Mean   :110.5   Mean   :43.71   Mean   : 52.25   Mean   : 58.21  
##  3rd Qu.:140.0   3rd Qu.:46.08   3rd Qu.: 65.75   3rd Qu.: 74.00  
##  Max.   :807.0   Max.   :53.46   Max.   :371.00   Max.   :436.00

Balancing information loss and accuracy

In order to control the disaggregation process, two more parameters set the thresholds on the inequity and loss rate. The extra parameter ineq.threshold, a rate between 0 and 1, specifies a threshold to force disaggregation when there is high inequality between sub-cells. The Theil entropy measure as computed in the _ineq package [@zeileis2009package] is used to measure inequality for each cell. The _ineq.threshold_ parameter defaults to 0.25. Lower values in the _ineq.threshold produce grids with smaller cells (see Figure \ref{fig:Figure 6}).

bcn.QT <- AQuadtree(BarcelonaPop, threshold = 5, ineq.threshold = 0.01)
plot(bcn.QT)
bcn.QT <- AQuadtree(BarcelonaPop, threshold = 5, ineq.threshold = 0.5)
plot(bcn.QT)

$\label{fig:Figure 6}Examples of the effect of the ineq.threshold parameter.$ $\label{fig:Figure 6}Examples of the effect of the ineq.threshold parameter.$

On the other side, the parameter loss.threshold, also a rate between 0 and 1, indicates a rate of loss to prevent disaggregation of cells. A low value states that lower loss is preferred on the resulting grid so less disaggregation is obtained.

AQuadtree object structure

A call to the AQuadtree function will return an AQuadtree class object with six slots indicating the parameters used on the creation of the grid:

dim: scale in meters of the highest level cells
layers: number of subdivision levels
colnames: attribute names summarized in the resulting grid
threshold: the value used for anonymization
thresholdField: attribute names to which the threshold restriction has been applied
loss: number of points discarded during the process of disaggregation because of the threshold

bcn.QT <- AQuadtree(BarcelonaPop)
slotNames(bcn.QT)
##  [1] "dim"            "layers"         "colnames"       "threshold"     
##  [5] "thresholdField" "loss"           "data"           "polygons"      
##  [9] "plotOrder"      "bbox"           "proj4string"

The data slot contains a dataframe with the information comprised in each cell:

total: number of points grouped in the cell.
level: scale of disaggregation of the cell.
residual: logical value indicating if the cell contains only residual points. Residual points are those that have been suppressed on the disaggregation process to get better accuracy, but can be grouped at the highest scale cell as it overcomes the given threshold.
cellCode: cell's size and the coordinates of the lower left cell corner in the ETRS89-LAEA system at the highest aggregation level.
cellNum: sequence of numbers indicating the position of the cell in the disaggregation scheme.

names(bcn.QT)
## [1] "cellCode" "cellNum"  "level"    "residual" "total"
head(bcn.QT)
## An object of class "AQuadtree" with 6 grid cells with sizes between 1km and 125m 
##        cellCode cellNum level residual total
## 1 1kmN2064E3665             1    FALSE   313
## 2 1kmN2065E3666             1    FALSE   317
## 3 1kmN2066E3659             1    FALSE   109
## 4 1kmN2066E3660             1    FALSE   434
## 5 1kmN2066E3666             1    FALSE   919
## 6 1kmN2066E3667             1    FALSE   564

Provided data

The package includes two SpatialPointsDataFrame_ objects: _BarcelonaPop_ for the city of Barcelona (Spain) and _CharlestonPop_ for the Charleston, SC metropolitan area (USA). Both objects contain random point data with the distributions of real data acquired at census scale from different sources. The package also provides two _SpatialPolygons_ objects with the spatial boundaries for each region. _BarcelonaCensusTracts_ and _CharlestonCensusTracts contain, respectively, the census tracts spatial limits for the city of Barcelona, and the census tracts spatial limits for the Charleston, SC metropolitan area.

BarcelonaPop comprises 81,359 sample points in the city of Barcelona, Spain. The original information was obtained from the statistics department of the Ajuntament de Barcelona, providing population data at the census tract level for the year 2018 [@AjuntamentdeBarcelona.DepartamentdEstadistica2018]. The points were generated and distributed randomly in space, maintaining unchanged the information at each census tract. To reduce the file size, only a sample of 7% of the points have been maintained.

data("BarcelonaPop", package = "AQuadtree")
summary(BarcelonaPop)
## Object of class SpatialPointsDataFrame
## Coordinates:
##       min     max
## x 3655447 3669871
## y 2059179 2074546
## Is projected: TRUE 
## proj4string :
## [+proj=laea +lat_0=52 +lon_0=10 +x_0=4321000 +y_0=3210000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs]
## Number of points: 81359
## Data attributes:
##       age            sex       
##  Min.   :  0.00   man  :38472  
##  1st Qu.: 27.00   woman:42887  
##  Median : 43.00                
##  Mean   : 43.94                
##  3rd Qu.: 61.00                
##  Max.   :100.00

In a similar way, the CharlestonPop object, with 54,619 random sample points, was created using the information in the dataset Charleston1 from the 2000 Census Tract Data for the Charleston, SC metropolitan area (USA) [@GeodaDataandLab2019]. To reduce the file size, only a sample of 10% of the points have been maintained.

data("CharlestonPop", package = "AQuadtree")
summary(CharlestonPop)
## Object of class SpatialPointsDataFrame
## Coordinates:
##         min       max
## x 2047824.7 2187604.2
## y  102892.3  219129.3
## Is projected: TRUE 
## proj4string :
## [+proj=lcc +lat_0=36.6666666666667 +lon_0=-98.5 +lat_1=38.5666666666667
## +lat_2=37.2666666666667 +x_0=400000 +y_0=400000 +ellps=GRS80
## +towgs84=0,0,0,0,0,0,0 +units=m +no_defs]
## Number of points: 54619
## Data attributes:
##        ID             origin          sex             age       
##  56     : 1274   asian   :  761   male  :26801   under16:12629  
##  22     : 1248   black   :16643   female:27818   16_65  :36314  
##  59     : 1128   hisp    : 1333                  over65 : 5676  
##  67     : 1076   multi_ra:  669                                 
##  30     : 1032   white   :35213                                 
##  70     : 1019                                                  
##  (Other):47842

Session info

Here is the output of session_info(“AQuadtree”) on the system on which this document was compiled:

devtools::session_info("AQuadtree")
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 3.3.2 (2016-10-31)
##  os       macOS  10.15.6              
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  C                           
##  ctype    UTF-8                       
##  tz       Europe/Berlin               
##  date     2020-09-07                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package    * version  date       lib source        
##  AQuadtree  * 1.0.2    2020-09-07 [1] local         
##  assertthat   0.2.1    2019-03-21 [2] CRAN (R 3.3.2)
##  BH           1.62.0-1 2016-11-19 [2] CRAN (R 3.3.2)
##  cli          2.0.1    2020-01-08 [2] CRAN (R 3.3.2)
##  crayon       1.3.4    2017-09-16 [2] CRAN (R 3.3.2)
##  digest       0.6.25   2020-02-23 [2] CRAN (R 3.3.2)
##  dplyr      * 0.8.4    2020-01-31 [2] CRAN (R 3.3.2)
##  ellipsis     0.3.0    2019-09-20 [2] CRAN (R 3.3.2)
##  fansi        0.4.1    2020-01-08 [2] CRAN (R 3.3.2)
##  glue         1.3.1    2019-03-12 [2] CRAN (R 3.3.2)
##  lattice      0.20-34  2016-09-06 [2] CRAN (R 3.3.0)
##  magrittr     1.5      2014-11-22 [2] CRAN (R 3.3.0)
##  pillar       1.4.3    2019-12-20 [2] CRAN (R 3.3.2)
##  pkgconfig    2.0.3    2019-09-22 [2] CRAN (R 3.3.2)
##  plogr        0.2.0    2018-03-25 [2] CRAN (R 3.3.2)
##  purrr        0.3.3    2019-10-18 [2] CRAN (R 3.3.2)
##  R6           2.2.0    2016-10-05 [2] CRAN (R 3.3.0)
##  Rcpp         1.0.3    2019-11-08 [2] CRAN (R 3.3.2)
##  rlang        0.4.4    2020-01-28 [2] CRAN (R 3.3.2)
##  sp         * 1.2-4    2016-12-22 [2] CRAN (R 3.3.2)
##  tibble       2.1.3    2019-06-06 [2] CRAN (R 3.3.2)
##  tidyselect   1.0.0    2020-01-27 [2] CRAN (R 3.3.2)
##  utf8         1.1.4    2018-05-24 [2] CRAN (R 3.3.2)
##  vctrs        0.2.3    2020-02-20 [2] CRAN (R 3.3.2)
## 
## [1] /private/var/folders/cf/7cn764jj39x3yyjcpd863ml00000gn/T/Rtmp1EnIim/Rinsta4616f8556
## [2] /Library/Frameworks/R.framework/Versions/3.3/Resources/library