ipumsr
Example - TerraThe IPUMS Terra project allows you to create datasets for your research that seamlessly combine area-level data, raster data and microdata and use whichever of these data formats is best for your analysis.
Raster data is data that describes the world on a grid - each cell in the grid gets its own value. Temperature is an example of data available in raster form from IPUMS Terra - satellite imagery allows geographers to create estimates of average temperature for approximately 1 km square grids across the globe.
Area level data is data that represents a particular area. In IPUMS Terra, the areas are political boundaries, like countries, states or provinces.
Microdata is data that describe each person separately. In IPUMS Terra, this data generally comes from census data from the IPUMS International project.
Learn more about the kinds of data in IPUMS Terra here: https://data.terrapop.org
ipumsr
helps you read this data into R so you can
continue your analysis.
library(ipumsr)
library(ggplot2)
library(dplyr)
library(sf)
This vignette focuses on the mechanics of getting spatial data from IPUMS Terra into R, but it glosses over the details about what you can do after you’ve loaded it. To learn more about these, here are some resources:
On this cold February day in Minnesota, I wonder how migration
between areas of the US is affected by the temperature and whether I see
any evidence of my preconception that older people retire to warmer
climates. As a quick aside - this analysis might be better served by
data from the IPUMS NHGIS project (see
vigntette("ipums-nhgis")
), because it often has more
granular geographic areas than IPUMS Terra, which which pulls from our
microdata projects, but I use Terra to show off the way it helps moving
between data types.
IPUMS Terra includes raster data on the average temperature 1950-2000 by month from the WorldClim dataset. It also converts area-level summaries of some variables from the US census microdata to raster grid, so we can get the total population and population by age on a grid.
For the example in this vignette, I use the sample 2010 US Census and variables:
Once we’ve made the extract and downloaded it, let’s check what’s
available in it using the function ipums_list_files()
.
ipums_list_files(raster_extract)
#> # A tibble: 7 × 2
#> type file
#> <chr> <chr>
#> 1 raster TEMPAVGFEBUS2013.tiff
#> 2 raster UnitedStatesStates-2010_rasterized_geoids.tiff
#> 3 raster UnitedStatesStates-2010-TOTPOP.tiff
#> 4 raster UnitedStatesStates-2010-POP6569.tiff
#> 5 raster UnitedStatesStates-2010-POP7074.tiff
#> 6 raster UnitedStatesStates-2010-POP7579.tiff
#> 7 raster UnitedStatesStates-2010-POP80.tiff
To load the data, ipumsr provides two functions:
read_terra_raster()
and
read_terra_raster_list()
. The first reads a single raster,
and the second loads 1 or more and puts them into a list.
Let’s first look at loading a single layer, by making a map of the
TEMPAVGFEB
layer (called TEMPAVGFEBUS2013.tiff
in the file, but we can refer to it with the matches()
function).
<- read_terra_raster(raster_extract, matches("TEMPAVGFEB"))
febtemp #> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.
The details of making good maps with raster data are beyond this
tutorial, (see below for some pointers on where to learn more), but just
to prove that we got the data, we can use the raster package’s built in
plot()
method.
# Crop map a bit
<- raster::extent(c(-180, -64, 16.5, 72.5))
zoomed_extent
::plot(febtemp / 10, ext = zoomed_extent)
rastertitle("Average temp (deg C) in Feb\n1950-2010 (US raster)")
We can also read multiple rasters and put them into a list object.
<- read_terra_raster_list(raster_extract)
all_rasters #> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.
Now we can make a map as before; this would be identical to the one above.
::plot(all_rasters[["TEMPAVGFEBUS2013"]] / 10, ext = zoomed_extent)
rastertitle("Average temp (deg C) in Feb\n1950-2010 (US raster)")
Again the details of spatial analysis of raster data in R are beyond the scope of this document (see below for some resources), but it one thing to keep in mind is that IPUMS Terra data, especially data from different sources (like rasterized area level data vs. the temperature data) may not have the same spatial extent (coverage area) or resolution, so you may need to do some conversion.
For example, to get the percent of the total population that is between ages 65-69, we can do this (because the data for total population and age-specific population have the same extent and resolution):
<- all_rasters[["UnitedStatesStates-2010-POP6569"]] /
pop_pct65 "UnitedStatesStates-2010-TOTPOP"]]
all_rasters[[
::plot(pop_pct65, ext = zoomed_extent)
rastertitle("Percent of population ages 65-69 in 2010\n(raseterized area)")
We can kind of see that the percent of population that is aged 65-69 is highest in a few places in warm places (in Arizona and Florida). To make this code run faster, I’ve only looked at 65-69 year olds, but really it’d be better to look at the percentage of the population that is 65+ by adding together all of the age categories. We leave that as an exercise for the reader
Also, comparing the population to the temperature on a grid level is
more difficult. One solution is to use the resample()
function from the raster package, which will use interpolation to put
the rasters on the same grid. This is beyond the scope of this
vignette.
IPUMS Terra includes data from both the raster datasets and microdata that have been aggregated up to a geographic boundary.
For the example in this vignette, I use the sample 2010 US Census and variables:
Also, be sure to include the boundary files in the extract.
Once we’ve made the extract and downloaded it, let’s check what’s
available in it using the function ipums_list_files()
.
ipums_list_files(area_extract)
#> # A tibble: 2 × 2
#> type file
#> <chr> <chr>
#> 1 data data_4618_IPUMS_US_SLAD_2010.csv
#> 2 shape boundaries_4618_IPUMS_US_SLAD_2010.zip
We can load the data and geography together using the functions:
read_terra_area_sf()
or
read_terra_area_sp()
(The first loads as an sf
object, while the second loads
them as the older SpatialPolygonsDataFrame
object)
Or we could load them separately using the functions:
read_terra_area()
and
read_ipums_sf()
/read_ipums_sp()
For now, we focus on reading them together as an sf
object, but the general idea of loading them is similar across all of
these functions.
<- read_terra_area_sf(area_extract)
terra_area #> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.
#> options: ENCODING=UTF-8
#> Reading layer `boundaries_4618_IPUMS_US_SLAD_2010' from data source
#> `C:\Users\derek\AppData\Local\Temp\Rtmp4Sx5Hj\file32e45c3e7806\boundaries_4618_IPUMS_US_SLAD_2010.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 2070 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -179.1526 ymin: 18.91236 xmax: 179.7758 ymax: 71.38875
#> Geodetic CRS: WGS 84
With the (currently) still in-development version of ggplot2, we can
use the function geom_sf()
to make maps of the sf
object.
Here’s a map of the average temperature in February, averaged over all rasters in a given geographic boundary:
<- terra_area %>%
terra_area mutate(feb_temp_c = TEMPAVGFEB_mean_GEO2_US2010_WORLDCLIM / 10)
ggplot(terra_area, aes(fill = feb_temp_c)) +
geom_sf(linetype = "blank") +
coord_sf(xlim = c(-180, -64), ylim = c(16.5, 72.5)) + # Crop out Alaska's East Hemisphere tail
ggtitle("Average temp (deg C) in Feb\n1950-2010 (US Geo2)")
Or we can look at a scatter comparing percentage of population older than 65 compared to average temperature in February.
<- terra_area %>%
terra_area mutate(
pct_pop_65 =
+ POP7074_GEO2_US2010_US2010A +
(POP6569_GEO2_US2010_US2010A + POP80_GEO2_US2010_US2010A) /
POP7579_GEO2_US2010_US2010A
(TOTPOP_GEO2_US2010_US2010A)
)
ggplot(terra_area, aes(x = pct_pop_65, y = feb_temp_c)) +
geom_point(alpha = 0.2) +
ggtitle("Average temperature against percent of\npopulation over 65 by US Geo 2")
#> Warning: Removed 1 rows containing missing values (geom_point).
Here we again see those retirement havens are in the warmer climates, though in general it doesn’t seem to be an incredibly strong relationship.
Finally, IPUMS Terra includes microdata with the area-level and raster data attached to it.
For the example in this vignette, I use the sample 2010 US Census and variables: - Microdata: - Household: GEO2_US2010, GEO1_US21010 - Person: AGE, MIGUS2 - Area Level: (None) - Raster: TEMPAVGFEB (mean, also year-specific and at lowest geographic level available)
Also, be sure to include the boundary files in the extract.
Once we’ve made the extract and downloaded it, let’s check what’s
available in it using the function ipums_list_files()
.
ipums_list_files(micro_extract)
#> # A tibble: 2 × 2
#> type file
#> <chr> <chr>
#> 1 data terrapop_extract_4621.csv.gz
#> 2 shape boundaries_4621_IPUMS_US_SLAD_2010.zip
For microdata, we should read the microdata using
read_terra_micro()
and the boundary files using either
read_ipums_sf()
or read_ipums_sp()
. We don’t
want to merge these two right away, because this would create a copy of
each polygon for every observation in the microdata. Instead, generally
we create area level summaries from the microdata and then merge them
together using the ipums_shape_*_join()
functions.
<- read_terra_micro(micro_extract)
terra_micro #> Use of data from IPUMS Terra is subject to conditions including that users
#> should cite the data appropriately. Use command `ipums_conditions()` for more
#> details.
<- read_ipums_sf(micro_extract)
terra_micro_shapes #> options: ENCODING=latin1
#> Reading layer `boundaries_4621_IPUMS_US_SLAD_2010' from data source
#> `C:\Users\derek\AppData\Local\Temp\Rtmp4Sx5Hj\file32e46cc376b9\boundaries_4621_IPUMS_US_SLAD_2010.shp'
#> using driver `ESRI Shapefile'
#> Simple feature collection with 2070 features and 2 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -179.1526 ymin: 18.91236 xmax: 179.7758 ymax: 71.38875
#> Geodetic CRS: WGS 84
Now we’ll create area level summaries for people older than 65 of what percent of the population has moved states in the past year.
<- terra_micro %>%
graph_data mutate(moved_states = MIGUS2 != GEO1_US2010) %>%
group_by(GEO2_US2010, age65plus = ifelse(AGE >= 65, "65 and older", "Under 65")) %>%
summarize(
moved_states = weighted.mean(moved_states, PERWT),
feb_temp_c = TEMPAVGFEB_mean_GEO2_US2010_WORLDCLIM[1] / 10
)#> `summarise()` has grouped output by 'GEO2_US2010'. You can override using the
#> `.groups` argument.
# ipums_shape_inner_join() will join the 2 datasets together, and keep
# only observations that are in both files. In this case, we only
# lose one observation that turns out to be the Great lakes.
<- ipums_shape_inner_join(
graph_data
graph_data,
terra_micro_shapes,by = c("GEO2_US2010" = "GEOID")
)#> Some observations were lost in the join (1 observations in the shape file). See
#> `join_failures(...)` for more details.
# Look at percent of population 65 and older who have moved in last year
ggplot(
%>% filter(age65plus == "65 and older"),
graph_data aes(fill = moved_states)
+
) geom_sf(linetype = "blank") +
coord_sf(xlim = c(-180, -64), ylim = c(16.5, 72.5)) + # Crop out Alaska's East Hemisphere tail
ggtitle("65 and older: percent moved from another\nstate in last year")
Looks like some parts of Florida and Arizona might be sticking out again, but it’s hard to say. Again, there’s lots more we could look at here, but I’ll leave that up to you!