Using the censusxy Package

Branson Fox and Christopher Prener, Ph.D.

2022-05-29

Overview

The censusxy package is designed to provide easy access to the U.S. Census Bureau Geocoding Tools in R.

Motivation

There do not exist many packages for free or reproducible geocoding in the R environment. However, the Census Bureau Geocoding Tools allow for both unlimited free geocoding as well as an added level of reproducibility compared to commercial geocoders. Many geospatial workflows involve a large quantity of addresses, hence our core focus is on batch geocoding.

Responsible Use

The U.S. Census Bureau makes their geocoding API available without any API key, and this package allows for virtually unlimited batch geocoding. Please use this package responsibly, as others will need use of this API for their research.

Installation

Installing censusxy

The easiest way to get censusxy is to install it from CRAN:

install.packages("censusxy")

Alternatively, the development version of censusxy can be accessed from GitHub with remotes:

# install.packages("remotes")
remotes::install_github("chris-prener/censusxy")

Installing Suggested Dependencies

Since the package does not need sf for its basic functionality, it is a suggested dependency rather than a required one. However, many users will want to map these data as sf objects, and we therefore recommend users install sf. Windows and macOS users should be able to install sf without significant issues unless they are building from source. Linux users will need to install several open source spatial libraries to get sf itself up and running.

If you want to use these sf, you can either install it individually (faster) or install all of the suggested dependencies at once (slower, will also give you a number of other packages you may or may not want):

## install sf only
install.packages("sf")

## install all suggested dependencies
install.packages("censusxy", dependencies = TRUE)

Geocoding Considerations

censusxy Performance

Server Availability

One of the biggest drawbacks of the Census Bureau’s geocoders is their availability and speed. Users of censusxy have reported issues to us where censusxy appears not to work on their computers. Generally speaking, the issues are with servers on the Census Bureau’s end. If you notice that the package does not seem to be working, take a break for a few hours and try again. You will hopefully find that your code, with no changes, executes correctly and returns geocoded data. If you notice continued issues, please do not hesitate to open an issue for us to investigate.

Processing Speed

Another performance issue with censusxy relates to speed. The issue, generally speaking, is not within the package itself but rather with external factors. These include your computer’s processing power, the quality of your internet connection, and - most importantly - the speed that the Census Bureau’s servers process your request. We have taken two steps to improve censusxy’s performance over using the Census Bureau’s geocoding web interface. The first is that we extract unique addresses in your data and only send those for geocoding. Thus, a large geocoding job may shrink considerably if it contains repetitive addresses.

The second step we take is to break your data into “chunks” if you have more than 1,000 records. This is below the maximum number of rows that the Census Bureau accepts for geocoding. censusxy sends these as separate geocoding requests before putting your data back together when geocoding is completed. One of the biggest things you can do to improve the performance of censusxy is to utilize our optional parallel processing features. When parallel processing is used, these chunks of data can be sent out simultaneously for geocoding.

Data Quality

A third performance issue with censusxy relates to the quality of input data. Street addresses that are not formatted according to USPS standards may cause an entire “chunk” of data to be returned without being geocoded. If you notice many NA values for location, check a few with the single line geocoder:

censusxy::cxy_single("3700 Lindell Blvd", "St. Louis", "MO", 63139)

If you get invalid data back, or an error like The server, while working as a gateway to get a response needed to handle the request, got an invalid response, there is something wrong with your specific address. One thing to watch for are non-ASCII characters, such as Lindèll in the example below:

censusxy::cxy_single("3700 Lindèll Blvd", "St. Louis", "MO", 63139)

This will return the error described above or an NA for cxy_status if you are using cxy_geocode().

If you get valid data back from the API with cxy_single(), it is more than likely that another address in the same “chunk” has a formatting problem that is causing errors. Pre-processing your data to eliminate problematic addresses (such as those that contain point of interest information or non-ASCII characters) can help eliminate this problem.

Another strategy is to break up your data into smaller chunks prior to sending it to be geocoded. Then, geocode these chunks individually to try and isolate the problematic addresses. For example, we can use the following base functions to create chunks of \(n = 200\) observations:

batches <- split(stl_homicides, (seq(nrow(stl_homicides))-1) %/% 200 )

These could then be passed individually to cxy_geocode() using batches[[1]] and so on.

Picking a Suitable API

The Census Geocoder contains 4 primary functions, 3 for single address geocoding, and 1 for batch geocoding. For interactive use cases, a Shiny application for example, the single line geocoder is recommended. For large quantities of addresses, the batch endpoint is favorable. If your use case is locating coordinates within census geometries, only a single coordinate function is available for this task.

Picking a Return Type

If you are interested in census geometries (composed of FIPS codes for state, county, tract and block), you should specify ‘geographies’ in the return argument. This also necessitates the use of a vintage.

Consider the Benchmark and Vintage

Vintage is only important to consider if you would like Census geographies to accompany your coordinate data. It has no impact on geocoding coordinates (location). You can obtain a data frame of valid benchmarks and vintages with their respective functions. For vintages, you must supply the name or ID of the benchmark you have chosen.

# return benchmarks
cxy_benchmarks()

# return vintages
cxy_vintages(benchmark = 'Public_AR_Current')

If you would like to return data including Census geographies, set return to "geographies".

Setting a specific benchmark can also be used to reproduce geocoding results. Changes to the data the Census Bureau supplies for geocoding may mean that the specific coordinates returned for a given address are modified. Using the latest benchmark may mean that addresses geocoded earlier are returned at a different coordinate. Using benchmarks allows you replicate earlier geocodes.

If you are not concerned about Census geographies or being able to reproduce the exact coordinates for each supplied, the functions will default to the latest benchmark.

Parallelization

This is an optional set of tools to take advantage of multiple threads if your computer has a multi-core processor. The parallel argument can be used to specify the number of cores you would like to dedicate to geocoding tasks. Once you execute the function, your batch geocoding request will be automatically distributed across your cores to return results faster. The function will not allow you to specify more cores than are available, and will instead default to the maximum number of available cores.

Since this is optional, there are additional dependencies that are not automatically installed. See the installation instructions above for details on ensuring your R environment is ready for parallel processing.

Selection a Class and Output

There are two options for class, which governs the type of object censusxy returns. By default, a "dataframe" is returned. However, users may optionally set class to "sf", which will return the results as an sf object. Doing so allows for an immediate preview of your data on a map and for these data to be exported in a geospatial format, like .geojson or .shp. However, returning an sf object will mean that censusxy can only return addresses successfully matched by the geocoder. A helpful message denoting how many rows were removed will print in the console.

Since returning an sf object is optional, there are additional dependencies that are not automatically installed. See the installation instructions above for details on ensuring your R environment is ready for parallel processing.

You may also specify output as "simple" or "full". Simple returns only coordinates (and a GEOID if return = "geographies") and this is suitable for most use cases. If you desire all of the raw output from the geocoder, please specify full instead.

Setting a Timeout Parameter

The function contains an argument for timeout, which specifies how many minutes until the API query ends as an error. In this implementation, it is per 1000 addresses, not the whole batch size. It is set to default at 30 minutes, which should be appropriate for most internet speeds.

If a batch times out, the function will terminate, and you will lose any geocoding progress.

Be cautious that batches taking a long time may allow your computer to sleep, which may cause a batch to never return. Turning off sleep behavior for your operating system while geocoding is recommended.

Usage

Importing Data

If you plan on using the single address geocoding tools, your data do not need to be in any specific class. To use the batch geocoder, your data must be in a data.frame (or equivalent class). This package provides data on homicides in St Louis City between 2008-2018 as example data in two forms - a small table with 24 addresses (stl_homicides_small) and a larger table with more records (stl_homicides)

# load package
library(censusxy)

# load sample data
df <- stl_homicides_small

Parsing Addresses

For use of the batch API, your address data needs to be structured. Meaning, your data contains separate columns for street address, city, state and zip-code. You may find the postmastr package useful for this task, though it remains early in its development life cycle. Only street address is mandatory, but omission of city, state or zip code drastically lowers the speed and accuracy of the batch geocoder.

Batch Geocoding

In this example, we will use the included stl_homicides_small data, which we loaded above and assigned to the object df, show the full process for batch geocoding with cxy_geocode():

df_sf <- cxy_geocode(df, address = "street_address", 
                     city = "city", 
                     state = "state", 
                     zip = "postal_code", 
                     class = "dataframe")

Since we have selected "dataframe" for our class, all input addressed will be returned. If you use class = "sf" (and have sf installed), you will receive only matched addresses.

df_sf <- cxy_geocode(df, address = "street_address", 
                     city = "city", 
                     state = "state", 
                     zip = "postal_code", 
                     class = "sf")

Though sf objects only returned matched addresses, one advantage of using this optional set of features is that output returned as an sf object can be previewed with a package like mapview:

mapview::mapview(df_sf)

Note that, like sf and the parallel processing dependencies, mapview is a suggested package and must be installed separately (or by using install.packages("censusxy", dependencies = TRUE)).

Single Address Tools

If you have individual addresses you would like to geocode, there are two additional functions to be aware. cxy_single() can geocode a single address that has already been parsed into separate entries for street, city, state, and zip:

cxy_single(street = "20 N Grand Blvd", city = "St. Louis", state = "MO", zip = 63103)

Alternatively, you can pass un-parsed addresses using cxy_oneline(). This example returns Census geography data along with the default location information:

cxy_oneline(address = "3700 Lindell Blvd, St. Louis, MO 63108", return = "geographies", vintage = "Current_Current")

Coordinate Pairs

Finally, if you know pairs of coordinates but would like to find out what Census geographies they fall within, you can use cxy_geography():

cxy_geography(lon = -90.23324, lat = 38.63593)

This function is useful for a small number of addresses, but returns a significant amount of data for each point location. If you have a large number of points and need a specific value like county or congressional district, using tigris to obtain those data and then sf::st_intersection() to perform a spatial join will be a more effective solution that performs this task faster and results in clearer output.

Iteration

For a handful of addresses, you may want to iterate using these functions. Two examples using base R are provided here:

addresses <- c("20 N Grand Blvd, St. Louis MO 63103", "3700 Lindell Blvd, St. Louis, MO 63108")

# with a loop
geocodes <- vector("list", length = 2)
for (i in seq_along(addresses)){
  geocodes[[i]] <- cxy_oneline(addresses[i])
}

# with apply
geocodes <- lapply(addresses, cxy_oneline)

Getting Help