The IPUMS microdata extract API is still in beta testing, and the interface for interacting with the API using ipumsr could still change in response to tester feedback. If you are interested in becoming a beta tester, email ipums+api@umn.edu.
To install the latest version of ipumsr from CRAN, use:
install.packages("ipumsr")
Since we are still actively developing the functions for interacting with the API, there may be changes in our GitHub repo that are not yet on CRAN. To install the development version of ipumsr from GitHub, use:
if (!require(remotes)) install.packages("remotes")
::install_github("ipums/ipumsr/ipumsexamples")
remotes::install_github(
remotes"ipums/ipumsr",
build_vignettes = TRUE,
dependencies = TRUE
)
The IPUMS microdata extract API allows registered IPUMS USA and CPS users to define extracts, submit extract requests, and download extract files without visiting the IPUMS website. ipumsr includes functions that help R users interact with the extract API from their R session.
library(ipumsr)
library(dplyr) # not necessary to use API functions, but used in some examples
library(purrr) # not necessary to use API functions, but used in some examples
If you don’t have an IPUMS USA or IPUMS CPS account, you’ll need to register for access. Here’s where you can register for IPUMS USA, or register for IPUMS CPS. You’ll also need to request API beta access by emailing ipums+api@umn.edu, as mentioned above.
Once you’re registered, you’ll need to create an API key.
Once you’ve created an API key, you can choose to supply it as a
function argument whenever interacting with the API, or you can set the
value of the IPUMS_API_KEY
environment variable to your
key. The example code in this vignette assumes you have assigned your
key to this environment variable.
To set the value of the IPUMS_API_KEY
environment
variable for your current session, you can use:
set_ipums_api_key("paste-your-key-here")
To set your API key and save it for use in future sessions,
use the same function, but with save
set to
TRUE
:
set_ipums_api_key("paste-your-key-here", save = TRUE)
This will add your API key to a file named “.Renviron” in your user
home directory, so that the value of the IPUMS_API_KEY
environment variable is set when R starts up.
Each IPUMS data collection with API support has its own function for
defining an extract. These functions have names of the form
define_extract_<collection>()
. Thus, to define an
IPUMS USA extract, you use define_extract_usa()
, and to
define an IPUMS CPS extract, you use
define_extract_cps()
.
All define_extract_()
functions return an
ipums_extract
object which can then be submitted using the
submit_extract()
function.
<- define_extract_usa(
usa_extract_definition description = "USA extract for API vignette",
samples = c("us2018a","us2019a"),
variables = c("AGE","SEX","RACE","STATEFIP")
)
<- define_extract_cps(
cps_extract_definition description = "CPS extract for API vignette",
samples = c("cps1976_01s", "cps1976_02b"),
variables = c("YEAR", "MISH", "CPSIDP", "AGE", "SEX", "RACE", "UH_SEX_B1")
)
For more details on the ipums_extract
class, view the
documentation page with ?ipums_extract-class
.
Note that samples are specified using special sample ID codes, which can be browsed here on the IPUMS USA website, or here for IPUMS CPS.
To submit your extract, use:
submit_extract(usa_extract_definition)
However, like the define_extract_()
functions, the
submit_extract()
function returns an
ipums_extract
object, and the returned object has been
updated to include the extract number, so it can be useful to save that
return object by assigning a name to it, like this:
<- submit_extract(usa_extract_definition) submitted_usa_extract
That way, you can use the submitted_usa_extract
object
as input to check the extract’s status, as shown in the next section, or
to reference the extract number:
$number submitted_usa_extract
To retrieve the latest status of an extract, you can use the
get_extract_info()
function.
get_extract_info()
returns an ipums_extract
object with the “status” element updated to reflect the latest status of
the extract, and the “download_links” element updated to include links
to any extract files that are available for download.
The “status” of a submitted extract is one of “queued”, “started”,
“produced”, “canceled”, “failed”, or “completed”. Only “completed”
extracts can be downloaded, but “completed” extracts older than 72 hours
may not be available for download, since extract files are removed after
that time (see discussion of the is_extract_ready()
function below).
If you assigned a name to the return value of
submit_extract()
, as shown above, you could get updated
information on the extract, returned as an ipums_extract
object, with:
<- get_extract_info(submitted_usa_extract) submitted_usa_extract
To print the latest status, you can use:
$status submitted_extract
If you forget to capture the return value of
submit_extract()
, you can pull down an
ipums_extract
object containing all the information on your
most recent extract for a given data collection with:
<- get_last_extract_info("usa") submitted_usa_extract
get_last_extract_info()
is just a convenience wrapper
around get_recent_extracts_info_list()
, described below.
If you don’t have an ipums_extract
object in your
environment that describes the extract you’re interested in, and you
don’t want the most recent extract, you can also query the latest status
of an extract by supplying the name of the IPUMS data collection and
extract number of the extract, in one of two formats. Here’s how you’d
get the latest information on IPUMS CPS extract number 33:
<- get_extract_info("cps:33") cps_extract_33
or
<- get_extract_info(c("cps", "33")) cps_extract_33
Note that in the first format, there are no spaces before or after the colon, and that in both formats, there is no need to zero-pad the extract number – in other words, use “33”, not “00033”.
If you want R to periodically check the status of your extract, and
only return an updated ipums_extract
object once the
extract is ready to download, you can use
wait_for_extract()
, as shown below:
<- wait_for_extract(cps_extract_33) downloadable_cps_extract
wait_for_extract()
also accepts the same
"collection:number"
and
c("collection", "number")
specifications shown above:
<- wait_for_extract("cps:33") downloadable_cps_extract
or
<- wait_for_extract(c("cps", "33")) downloadable_cps_extract
For large extracts that take a long time to produce, or when the
IPUMS servers are busy, you may not want to use
wait_for_extract()
, as it will tie up your R session until
the extract is ready to download.
wait_for_extract()
will tie up your R session until the
extract is ready to download, so it might not be the best option for
large extracts that take a long time to produce. However,
wait_for_extract()
does offer a
timeout_seconds
argument to set the maximum number of
seconds you want the function to wait. By default, that argument is set
to 10,800 seconds (3 hours).
An alternative way to check whether your extract is ready to download
is using the is_extract_ready()
function. This function
accepts either an ipums_extract
object or a
"collection:number"
or
c("collection", "number")
specification, and returns a
single TRUE
or FALSE
value indicating whether
the extract is ready to be downloaded.
is_extract_ready(cps_extract_33)
is_extract_ready("cps:33")
is_extract_ready(c("cps", "33"))
As noted above, only extracts with status “completed” can be ready to
download, but not all “completed” extracts are ready to download,
because extract files are removed from IPUMS servers after 72 hours. The
is_extract_ready()
function checks whether an extract can
currently be downloaded by looking at the “download_links” element of
the extract object returned by the API.
Note that the API has a limit of 60 requests with the same API key
per minute, so you wouldn’t want to write a loop that repeatedly uses
is_extract_ready()
to check your extract status.
Once your extract is ready to download, use the
download_extract()
function to download the data and DDI
codebook files to your computer. The download_extract()
function returns the path to the DDI codebook file, which can be used to
read in the downloaded data with ipumsr functions. By default, the
function will download files into your current working directory, but
alternative locations can be specified with the
download_dir
argument.
<- download_extract(submitted_usa_extract)
ddi_path
<- read_ipums_ddi(ddi_path)
ddi <- read_ipums_micro(ddi) data
Or, using a "collection:number"
or
c("collection", "number")
specification:
<- download_extract("cps:33")
ddi_path <- download_extract(c("cps", "33")) ddi_path
ipumsr also includes convenience functions for revising a previous extract definition, facilitating a “revise and resubmit” workflow. Here’s how you would pull down the definition of USA extract number 33 and add a sample and a variable to it:
<- get_extract_info("usa:33")
old_extract <- add_to_extract(old_extract, samples = "us2020a", vars = "RELATE") new_extract
The add_to_extract()
function returns an
ipums_extract
object that has been modified as requested
and has been reset to an unsubmitted state, by stripping the extract
number, status, and download links from the original extract. The
revised extract can then be submitted with:
<- submit_extract(new_extract) newly_submitted_extract
To remove values from an extract, use
remove_from_extract()
:
<- remove_from_extract(new_extract, samples = "us2020a") newer_extract
You can query the API for the details and status of recent extracts
(the ten most recent, by default) using the functions
get_recent_extracts_info_list()
and
get_recent_extracts_info_tbl()
. The _list
version of the function returns a list of ipums_extract
objects, whereas the _tbl
version returns a tibble
(enhanced “data.frame”) in which each row contains information on one
extract.
The list representation is useful if you want to be able to operate
on elements as ipums_extract
objects. For instance, to
retrieve your most second-most-recent extract and revise it for
resubmission, you could use:
<- get_recent_extracts_info_list("usa")[[2]]
second_most_recent_extract <- revise_extract_micro(
revised_extract
second_most_recent_extract, samples_to_add = "us2010a"
)
Or to download all recent extracts that are ready to download, using
purrr::keep()
and purrr::map_chr()
:
<- get_recent_extracts_info_list("usa") %>%
ddi_paths keep(is_extract_ready) %>%
map_chr(download_extract)
The tibble representation is useful if you want to use functions for manipulating data.frames to find recent extracts matching particular criteria.
<- get_recent_extracts_info_tbl("usa") recent_usa_extracts_tbl
For example, to find extracts with descriptions including the word “occupation”, you could use:
%>%
recent_usa_extracts filter(grepl("occupation", description))
Filtering on properties such as “samples” or “variables” is a little
more complex, because these are stored in list columns, but it is
possible. For example, to find extracts including the variable “AGE”,
you could use purrr::map_lgl()
like this:
%>%
recent_usa_extracts filter(map_lgl(variables, ~"AGE" %in% .x))
To convert between these two representations, ipumsr provides the
functions extract_list_to_tbl()
and
extract_tbl_to_list()
, such that the following is
TRUE
:
identical(
extract_list_to_tbl(get_recent_extracts_info_list("usa")),
get_recent_extracts_info_tbl("usa")
)
The return values of the functions to interact with the API are configured in such a way that you can define, submit, wait for, download, and read in your extract all in one piped expression:
<-
data define_extract_usa(
"USA extract for API vignette",
c("us2018a","us2019a"),
c("AGE","SEX","RACE","STATEFIP")
%>%
) submit_extract() %>%
wait_for_extract() %>%
download_extract() %>%
read_ipums_micro()