Browsing data on the IPUMS website can be a little like grocery shopping when you’re hungry — you show up to grab a couple things, but everything looks so good, and you end up with an overflowing cart1. When you do this with IPUMS data, unfortunately, sometimes your extract may get so large that it doesn’t fit in your computer’s memory.
If you’ve got an extract that’s too big, both the IPUMS website and the ipumsr package have tools to help. There are four basic strategies:
The IPUMS website has features to help with option 2, and the ipumsr package can help you with options 3 and 4 (option 1 depends on your wallet).
The examples in this vignette will rely on the ipumsr, dplyr and
biglm packages, and the example CPS extract used in the
ipums-cps
vignette. If you want to follow along, you should
follow the instructions in that vignette to make an extract.
library(ipumsr)
library(dplyr)
# To run the full vignette you'll also need the following packages:
<- requireNamespace("biglm")
installed_biglm <- requireNamespace("DBI") &
installed_db_pkgs requireNamespace("RSQLite") &
requireNamespace("dbplyr")
# Change these filepaths to the filepaths of your downloaded extract
<- "cps_00001.xml"
cps_ddi_file <- "cps_00001.dat" cps_data_file
If you’ve got a dataset that’s too big for your RAM, you could always get more. You could accomplish this by upgrading your current computer, getting a new one, or paying a cloud service like Amazon or Microsoft Azure (or one of the many other similar services). Here are guides for using R on Amazon and Microsoft Azure.
The IPUMS website has many features that will let you reduce the size of your extract. The easiest thing to do is to review your sample and variable selections to see if you can drop some.
If you do need every sample and variable, but your analysis is on a specific subset of the data, the IPUMS extract engine has a feature called “Select Cases”, where you can subset on an included variable (for example you could subset on AGE so that your extract only includes those older than 65, or subset on EDUCATION to look at only college graduates). In most IPUMS microdata projects, the select cases feature is on the “Create Extract” page, as the last step before you submit the extract. If you’ve already submitted the extract, you can click the “revise” link on the “Download or Revise Extracts” page to access the “Select Cases” feature.
Or, if you would be happy with a random subsample of the data, the IPUMS extract engine has an option to “Customize Sample Size” that will take a random sample. This feature is also available on the “Create Extract” page, as the last step before you submit the extract. Again, if you’ve already submitted your extract, you can access this feature by clicking the “revise” link on the “Download or Revise Extracts” page.
ipumsr has two closely related concepts for reading parts of a function at a time. The “chunk” API inherits directly from the readr framework for reading chunks of data. These functions allow you to specify a function that will be called on each chunk and then how you’d like to combine them at the end. The “yield” API is unique to ipumsr (and fixed-width data), but allows a little more flexibility because control is returned to you between each piece of data.
ipumsr has “chunked” versions of the microdata reading functions
(read_ipums_micro_chunked()
and
read_ipums_micro_list_chunked()
). These chunked versions of
the functions allow you to specify a function that will be applied to
each chunk, and then also control how the results from these chunks are
combined. This functionality is based on the chunked functionality
introduced by readr
and so is quite flexible. Below, we’ll
outline solutions to three common use-cases for IPUMS data: tabulation,
regression and selecting cases.
Let’s say you want to find the percent of people in the workforce by their self-reported health. Since this extract is small enough to fit in memory, we could just do the following:
read_ipums_micro(
data_file = cps_data_file, verbose = FALSE
cps_ddi_file, %>%
) mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
%>%
) as_factor()
%>%
) group_by(HEALTH, AT_WORK) %>%
summarize(n = n(), .groups = "drop")
#> # A tibble: 10 × 3
#> HEALTH AT_WORK n
#> <fct> <fct> <int>
#> 1 Excellent No 40582
#> 2 Excellent Yes 28071
#> 3 Very good No 32367
#> 4 Very good Yes 32947
#> 5 Good No 26726
#> 6 Good Yes 22483
#> 7 Fair No 11089
#> 8 Fair Yes 4520
#> 9 Poor No 5418
#> 10 Poor Yes 780
But let’s pretend like we can only store 1,000 rows at a time. In this case, we need to use a chunked function, tabulate for each chunk, and then calculate the counts across all of the chunks.
First we’ll make the callback function, which will take two arguments: x (the data from a chunk) and pos (the position of the chunk, expressed as the line in the input file at which the chunk starts). We’ll only use x, but the callback function must always take both these arguments.
<- function(x, pos) {
cb_function %>% mutate(
x HEALTH = as_factor(HEALTH),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
%>%
) as_factor()
%>%
) group_by(HEALTH, AT_WORK) %>%
summarize(n = n(), .groups = "drop")
}
Next we need to create a callback object. The choice of a callback object depends mainly on how we want to combine the results from applying our callback function to each chunk. The tree main types of callbacks that apply IPUMS variable metadata are:
IpumsDataFrameCallback
- Combine the results from each
chunk together by row binding them together.IpumsListCallback
- Use when you don’t want to (or
can’t) immediately combine the results, so it returns a list with one
item per chunk containing the results.IpumsSideEffectCallback
- Use when you are calling the
function primarily for side effects, and do not need the results at all
(for example if you are saving each chunk to disk).(The fourth, which is used for running lm regression is discussed later in this document).
In this case, we want to row-bind the data.frames returned by
cb_function()
, so we use
IpumsDataFrameCallback
.
Callback objects are R6 objects, but you
don’t need to be familiar with R6 to use them2. For now, all we
really need to know is that to create a callback, we use
$new()
syntax.
<- IpumsDataFrameCallback$new(cb_function) cb
Next we read in the data with the
read_ipums_micro_chunked()
function, specifying the
callback and that we want the chunk_size
to be 1000.
<- read_ipums_micro_chunked(
chunked_tabulations data_file = cps_data_file, verbose = FALSE,
cps_ddi_file, callback = cb, chunk_size = 1000
)
Now we have a data.frame with the counts by health and work status within each chunk. To get the full table, we just need to sum by health and work status one more time.
%>%
chunked_tabulations group_by(HEALTH, AT_WORK) %>%
summarize(n = sum(n), .groups = "drop")
#> # A tibble: 10 × 3
#> HEALTH AT_WORK n
#> <fct> <fct> <int>
#> 1 Excellent No 40582
#> 2 Excellent Yes 28071
#> 3 Very good No 32367
#> 4 Very good Yes 32947
#> 5 Good No 26726
#> 6 Good Yes 22483
#> 7 Fair No 11089
#> 8 Fair Yes 4520
#> 9 Poor No 5418
#> 10 Poor Yes 780
With the biglm package, it is possible to use R to perform a
regression on data that is too large to store in memory all at once. The
ipumsr package provides a callback designed to make this simple:
IpumsBiglmCallback
.
Again we’ll use the CPS example, which is small enough that we can keep it in memory. Here’s an example of a regression looking at how hours worked, self-reported health and age are related among those who are currently working. This is meant as a simple example, and ignores many of the complexities in this relationship, so please use caution when interpreting.
# Read in data
<- read_ipums_micro(
data data_file = cps_data_file, verbose = FALSE
cps_ddi_file,
)
# Prepare data for model
# (age has been capped at 99, which we assume is high enough to not
# cause any problems so we leave it.)
<- data %>%
data mutate(
HEALTH = as_factor(HEALTH),
AHRSWORKT = lbl_na_if(AHRSWORKT, ~.lbl == "NIU (Not in universe)"),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
%>%
) as_factor()
%>%
) filter(AT_WORK == "Yes")
# Run regression
<- lm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data)
model summary(model)
#>
#> Call:
#> lm(formula = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -41.230 -4.949 -0.080 5.945 75.697
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 5.626953 0.367851 15.297 < 2e-16 ***
#> AGE 1.568287 0.017790 88.156 < 2e-16 ***
#> I(AGE^2) -0.016798 0.000204 -82.338 < 2e-16 ***
#> HEALTHVery good -0.280826 0.104433 -2.689 0.00717 **
#> HEALTHGood -1.275358 0.115861 -11.008 < 2e-16 ***
#> HEALTHFair -3.614487 0.207121 -17.451 < 2e-16 ***
#> HEALTHPoor -5.732656 0.465751 -12.308 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 12.8 on 88794 degrees of freedom
#> Multiple R-squared: 0.08886, Adjusted R-squared: 0.08879
#> F-statistic: 1443 on 6 and 88794 DF, p-value: < 2.2e-16
To do the same regression, but with only 1000 rows loaded at a time, we work in a similar manner.
First we make the IpumsBiglmCallback
callback object
that specifies both the model and a function to prepare the data.
<- IpumsBiglmCallback$new(
biglm_cb model = AHRSWORKT ~ AGE + I(AGE^2) + HEALTH,
prep = function(x, pos) {
%>%
x mutate(
HEALTH = as_factor(HEALTH),
AHRSWORKT = lbl_na_if(AHRSWORKT, ~.lbl == "NIU (Not in universe)"),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
%>%
) as_factor()
%>%
) filter(AT_WORK == "Yes")
} )
And then we read the data using
read_ipums_micro_chunked()
, passing the callback that we
just made.
<- read_ipums_micro_chunked(
chunked_model data_file = cps_data_file, verbose = FALSE,
cps_ddi_file, callback = biglm_cb, chunk_size = 1000
)
summary(chunked_model)
#> Large data regression model: biglm(AHRSWORKT ~ AGE + I(AGE^2) + HEALTH, data, ...)
#> Sample size = 88801
#> Coef (95% CI) SE p
#> (Intercept) 5.6270 4.8913 6.3627 0.3679 0.0000
#> AGE 1.5683 1.5327 1.6039 0.0178 0.0000
#> I(AGE^2) -0.0168 -0.0172 -0.0164 0.0002 0.0000
#> HEALTHVery good -0.2808 -0.4897 -0.0720 0.1044 0.0072
#> HEALTHGood -1.2754 -1.5071 -1.0436 0.1159 0.0000
#> HEALTHFair -3.6145 -4.0287 -3.2002 0.2071 0.0000
#> HEALTHPoor -5.7327 -6.6642 -4.8012 0.4658 0.0000
Sometimes you may want to select a subset of the data before reading
it in. The IPUMS website has this functionality built in, which can be a
faster way to do this (this “select cases” functionality is described in
the second section above). Also, Unix commands like awk
and
sed
will generally be much faster than these R based
solutions. However, it is possible to use the chunked functions to
create a subset, which can be convenient if you want to subset on some
complex logic that would be hard to code into the IPUMS extract system
or Unix tools.
# Subset only those in "Poor" health
<- read_ipums_micro_chunked(
chunked_subset data_file = cps_data_file, verbose = FALSE,
cps_ddi_file, callback = IpumsDataFrameCallback$new(function(x, pos) {
filter(x, HEALTH == 5)
}), chunk_size = 1000
)
ipumsr now has a second form of chunked reading called ‘yields’ which accomplish a similar goal as chunks, but are more flexible. They grant you more freedom in what R code you run between reading chunks, including the ability to have multiple files open at once. Additionally, yields are compatible with the bigglm function which allows you to run glm models on data larger than memory (as opposed to plain old lm’s that you can run with chunks). The downside to this greater control is that yields have an API that is unique to IPUMS data and the way they work is unusual for R code.
To see how the yield functions work as compared to the chunked functions, here’s the same chunked tabulation example that was shown earlier in this document, but using yields rather than chunks. We want to calculate the percent of people in the workforce by their self-reported health status.
First we create the yield object with the function
read_ipums_micro_yield()
– the arguments are similar to
read_ipums_micro()
.
<- read_ipums_micro_yield(
data
cps_ddi_file, data_file = cps_data_file,
verbose = FALSE
)
This function returns an R6 object (like the callbacks for the
chunked data) which contains methods for reading the data. The most
important method is the yield(n)
function which will return
n rows of the data.frame. In this example, we’ll also use
is_done()
, which tells us whether we’ve finished reading
the data yet.
<- tibble(
yield_results HEALTH = factor(levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
AT_WORK = factor(levels = c("No", "Yes")),
n = integer(0)
)while (!data$is_done()) {
<- data$yield(n = 1000) %>%
new mutate(
HEALTH = as_factor(HEALTH),
AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
%>%
) as_factor()
%>%
) group_by(HEALTH, AT_WORK) %>%
summarize(n = n())
<- bind_rows(yield_results, new) %>%
yield_results group_by(HEALTH, AT_WORK) %>%
summarize(n = sum(n), .groups = "drop")
}#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'HEALTH'. You can override using the
#> `.groups` argument.
yield_results#> # A tibble: 10 × 3
#> HEALTH AT_WORK n
#> <fct> <fct> <int>
#> 1 Excellent No 40582
#> 2 Excellent Yes 28071
#> 3 Very good No 32367
#> 4 Very good Yes 32947
#> 5 Good No 26726
#> 6 Good Yes 22483
#> 7 Fair No 11089
#> 8 Fair Yes 4520
#> 9 Poor No 5418
#> 10 Poor Yes 780
One of the major benefits of the yield-style reading over chunks is that it is compatible with the GLM functions from biglm, allowing more complicated models than just linear models. Logistic models are one such model that requires GLM methods. The same caution as the chunked example applies - this is meant as a simple example model, please interpret with caution.
To run a logistic model we start by loading the data in the same way as the frequency example above:
<- read_ipums_micro_yield(
data
cps_ddi_file, data_file = cps_data_file,
verbose = FALSE
)
Next we make a function that takes a single argument
reset
. When it is FALSE
, it returns the next
chunk of data (and formats it for our model). When reset is
TRUE
, it resets the data to the beginning. To create this
function, we use the the reset()
method from the yield
object.
<- function(reset) {
get_model_data if (reset) {
$reset()
dataelse {
} <- data$yield(n = 1000) # Set n pretty low for example
yield if (is.null(yield)) return(yield)
<- yield %>%
out mutate(
HEALTH = as_factor(HEALTH),
WORK30PLUS = lbl_na_if(AHRSWORKT, ~.lbl == "NIU (Not in universe)") %>%
>= 30},
{. AT_WORK = EMPSTAT %>%
lbl_relabel(
lbl(1, "Yes") ~ .lbl == "At work",
lbl(0, "No") ~ .lbl != "At work"
%>%
) as_factor()
%>%
) filter(AT_WORK == "Yes")
return(out)
} }
Finally we feed this function and a model specification to the
bigglm()
function:
library(biglm)
#> Loading required package: DBI
<- bigglm(
results ~ AGE + I(AGE^2) + HEALTH,
WORK30PLUS family = binomial(link = "logit"),
data = get_model_data
)
summary(results)
#> Large data regression model: bigglm(WORK30PLUS ~ AGE + I(AGE^2) + HEALTH, family = binomial(link = "logit"),
#> data = get_model_data)
#> Sample size = 88801
#> Coef (95% CI) SE p
#> (Intercept) -3.9623 -4.0956 -3.8290 0.0667 0.0000
#> AGE 0.2677 0.2609 0.2744 0.0034 0.0000
#> I(AGE^2) -0.0029 -0.0030 -0.0028 0.0000 0.0000
#> HEALTHVery good 0.0491 0.0043 0.0939 0.0224 0.0282
#> HEALTHGood -0.1248 -0.1737 -0.0760 0.0244 0.0000
#> HEALTHFair -0.6458 -0.7250 -0.5666 0.0396 0.0000
#> HEALTHPoor -0.9561 -1.1209 -0.7913 0.0824 0.0000
Databases are another option for data that cannot fit in memory as an R data.frame. If you have access to a database on a remote machine, then you can easily pull in parts of the data for your analysis. Even if you’ll need to store the database on your machine, it may have more efficient storage of data so your data fits in your memory, or it may use your hard drive.
R’s tools for integrating with databases are improving quickly. The DBI package has been updated, dplyr (through dbplyr) provides a frontend that allows you to write the same code for data in a database as you would for a local data.frame, and packages like sparklyr, sparkR, bigrquery and others provide access to the latest and greatest.
There are many different kinds of databases, each with their own benefits, weaknesses and tradeoffs. As such, it’s hard to give concrete advice without knowing your specific use-case. However, once you’ve chosen a database, in general, there will be two steps: Importing the data into the database and then connecting it to R.
As an example, we’ll use the RSQLite package to load the data into an in-memory database. RSQLite is great because it is easy to set up, but it is probably not efficient enough to help you if you need to use a database because your data doesn’t fit in memory.
When using rectangular extracts, your best bet to import IPUMS data into your database is probably going to be a csv file. Most databases support csv importing, and these implementations will generally be well supported since this is a common file format.
However, if you need a hierarchical extract, or your database software doesn’t support the csv format, then you can use the chunking functions to load the data into a database without storing the full data in R.
# Connect to database
library(DBI)
library(RSQLite)
<- dbConnect(SQLite(), path = ":memory:")
con
# Add data to tables in chunks
<- read_ipums_ddi(cps_ddi_file)
ddi read_ipums_micro_chunked(
ddi,data_file = cps_data_file,
::SideEffectChunkCallback$new(function(x, pos) {
readrif (pos == 1) {
dbWriteTable(con, "cps", x)
else {
} dbWriteTable(con, "cps", x, row.names = FALSE, append = TRUE)
}
}),chunk_size = 1000,
verbose = FALSE
)#> NULL
The dbplyr vignette “dbplyr” (which you can access with
vignette("dbplyr", package = "dbplyr")
) is a good place to
get started learning about how to connect to a database. Here I’ll just
briefly show some examples.
<- tbl(con, "cps")
example
%>%
example filter('AGE' > 25)
#> # Source: lazy query [?? x 14]
#> # Database: sqlite 3.38.5 []
#> YEAR SERIAL HWTSUPP CPSID ASECFLAG FOODSTMP MONTH PERNUM CPSIDP WTSUPP
#> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 2011 2 545. 2.01e13 1 1 3 1 2.01e13 545.
#> 2 2011 3 475. 2.01e13 1 1 3 1 2.01e13 475.
#> 3 2011 3 475. 2.01e13 1 1 3 2 2.01e13 475.
#> 4 2011 4 548. 2.01e13 1 1 3 1 2.01e13 548.
#> 5 2011 4 548. 2.01e13 1 1 3 2 2.01e13 548.
#> 6 2011 5 475. 2.01e13 1 1 3 1 2.01e13 475.
#> 7 2011 6 298. 2.01e13 1 1 3 1 2.01e13 298.
#> 8 2011 6 298. 2.01e13 1 1 3 2 2.01e13 298.
#> 9 2011 6 298. 2.01e13 1 1 3 3 2.01e13 352.
#> 10 2011 6 298. 2.01e13 1 1 3 4 2.01e13 404.
#> # … with more rows, and 4 more variables: AGE <int>, EMPSTAT <int>,
#> # AHRSWORKT <dbl>, HEALTH <int>
Though dbplyr shows us a nice preview of the first rows of the result
of our query, the data still lives in the database. When using a regular
database, in general you’d use the function
dplyr::collect()
to load in the full results of the query
to your R session. However, the database has no concept of IPUMS
attributes like value and variable labels, so if you want them, you can
use ipums_collect()
like so:
%>%
example filter('AGE' > 25) %>%
ipums_collect(ddi)
#> # A tibble: 204,983 × 14
#> YEAR SERIAL HWTSUPP CPSID ASECFLAG FOODSTMP MONTH PERNUM CPSIDP WTSUPP
#> <dbl> <dbl> <dbl> <dbl> <int+lbl> <int+lb> <int+l> <dbl> <dbl> <dbl>
#> 1 2011 2 545. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 1 2.01e13 545.
#> 2 2011 3 475. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 1 2.01e13 475.
#> 3 2011 3 475. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 2 2.01e13 475.
#> 4 2011 4 548. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 1 2.01e13 548.
#> 5 2011 4 548. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 2 2.01e13 548.
#> 6 2011 5 475. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 1 2.01e13 475.
#> 7 2011 6 298. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 1 2.01e13 298.
#> 8 2011 6 298. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 2 2.01e13 298.
#> 9 2011 6 298. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 3 2.01e13 352.
#> 10 2011 6 298. 2.01e13 1 [ASEC] 1 [No] 3 [Mar… 4 2.01e13 404.
#> # … with 204,973 more rows, and 4 more variables: AGE <int+lbl>,
#> # EMPSTAT <int+lbl>, AHRSWORKT <dbl+lbl>, HEALTH <int+lbl>
Big data is a problem for lots of R users, not just IPUMS users, so there are a lot of resources to help you out! These are just a few that I found useful while writing this document:
Bonus joke: Why is the IPUMS website better than any grocery store? Answer: More free samples.↩︎
If you’re interested in learning more about R6, the upcoming revision to Hadley Wickham’s Advanced R book includes a chapter on R6 available for free here↩︎