Integrated variables in IPUMS data often have value labels, which
attach text labels to the values taken by a variable (for example the
HEALTH variable has value labels: 1 = “Excellent”, 2 = “Very good”,
etc.). The ipumsr
package does import the labels, but not
as factors, which may be how you were expecting them.
library(ipumsr)
<- read_ipums_ddi(ipums_example("cps_00015.xml"))
ddi <- read_ipums_micro(ddi, verbose = FALSE)
cps
cps#> # A tibble: 10,883 × 13
#> YEAR SERIAL HWTSUPP STATEFIP ASECFLAG MONTH PERNUM WTSUPP AGE EDUC
#> <dbl> <dbl> <dbl> <int+lbl> <int+lb> <int+l> <dbl> <dbl> <int> <int+lbl>
#> 1 2016 24138 3249. 55 [Wisc… 1 [ASEC] 3 [Mar… 1 3249. 54 73 [Hig…
#> 2 2016 24139 3154. 55 [Wisc… 1 [ASEC] 3 [Mar… 1 3154. 54 73 [Hig…
#> 3 2016 24139 3154. 55 [Wisc… 1 [ASEC] 3 [Mar… 2 3154. 52 73 [Hig…
#> 4 2016 24140 1652. 55 [Wisc… 1 [ASEC] 3 [Mar… 1 1652. 38 60 [Gra…
#> 5 2016 24140 1652. 55 [Wisc… 1 [ASEC] 3 [Mar… 2 1503. 15 10 [Gra…
#> 6 2016 24140 1652. 55 [Wisc… 1 [ASEC] 3 [Mar… 3 1652. 38 73 [Hig…
#> 7 2016 24141 3049. 55 [Wisc… 1 [ASEC] 3 [Mar… 1 3049. 85 30 [Gra…
#> 8 2016 24142 1637. 55 [Wisc… 1 [ASEC] 3 [Mar… 1 1637. 27 111 [Bac…
#> 9 2016 24142 1637. 55 [Wisc… 1 [ASEC] 3 [Mar… 2 1637. 27 111 [Bac…
#> 10 2016 24142 1637. 55 [Wisc… 1 [ASEC] 3 [Mar… 3 1887. 2 1 [NIU…
#> # … with 10,873 more rows, and 3 more variables: INCTOT <dbl+lbl>,
#> # HEALTH <int+lbl>, MIGRATE1 <int+lbl>
The first clue that some of the variables are labelled is the
<dbl+lbl>
that appear below STATEFIP
,
ASECFLAG
and other variables. The tibble package prints the
variable’s type information below the variable name, and this “+lbl”
indicates that the variable uses the labelled
type. The
function is.labelled()
will also tell you if a variable is
labelled.
By default the ipumsr package now will attempt to show the value
labels when printing out a data.frame (like the "[Wis~"
on
STATEFIP above, for Wisconsin). If your console supports colorful
output, then the labels will be in a lighter gray than the values. You
can turn off this printing behavior by setting the option
options("ipumsr.show_pillar_labels" = FALSE)
)
However, as you can see, there often isn’t enough space to see the labels when printing out a full data.frame. So, there are a few options to see the labels directly:
# Printing the variable directly (or a subset)
head(cps$MONTH)
#> <labelled<integer>[6]>: Month
#> [1] 3 3 3 3 3 3
#>
#> Labels:
#> value label
#> 1 January
#> 2 February
#> 3 March
#> 4 April
#> 5 May
#> 6 June
#> 7 July
#> 8 August
#> 9 September
#> 10 October
#> 11 November
#> 12 December
# Just get the labels
ipums_val_labels(cps$MONTH)
#> # A tibble: 12 × 2
#> val lbl
#> <int> <chr>
#> 1 1 January
#> 2 2 February
#> 3 3 March
#> 4 4 April
#> 5 5 May
#> 6 6 June
#> 7 7 July
#> 8 8 August
#> 9 9 September
#> 10 10 October
#> 11 11 November
#> 12 12 December
# or if you're working interactively you can use ipums_view
# ipums_view(ddi)
The usual way to connect numeric data to labels in R is in
factor
variables. Though this data type is more native to
R, and more widely supported by R code, it was designed for efficient
calculations in linear models, not as a general purpose value labeling
system and so is missing important features that the value labels
provided by IPUMS require.
Factors only allow for integers to be mapped to a text label, and
these integers have to be a count starting at 1. This doesn’t work for
IPUMS data because often our variables have specific meanings for the
codes. For example, the variable AGE
uses the value to mean
the actual age, but does have labels for age 0 and the top codes.
head(cps$AGE)
#> <labelled<integer>[6]>: Age
#> [1] 54 54 52 38 15 38
#>
#> Labels:
#> value label
#> 0 Under 1 year
#> 90 90 (90+, 1988-2002)
#> 99 99+
$AGE_FACTOR <- as_factor(cps$AGE)
cpshead(cps$AGE_FACTOR)
#> [1] 54 54 52 38 15 38
#> 84 Levels: Under 1 year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ... 99+
It may seem like the new AGE_FACTOR
variable is okay,
but it can be very confusing!
mean(cps$AGE)
#> [1] 35.0226
# mean(cps$AGE_FACTOR) # error because data is a factor not numeric
mean(as.numeric(cps$AGE_FACTOR)) # A common mistake
#> [1] 35.94836
# The "more" correct way, but NA because of the text labels
mean(as.numeric(as.character(cps$AGE_FACTOR)))
#> Warning in mean(as.numeric(as.character(cps$AGE_FACTOR))): NAs introduced by
#> coercion
#> [1] NA
Because the factor variable has to assign values starting at 1, but the AGE variable started at 0, most values were 1 higher than they should have been. Not all values are 1 higher though, because not all values exist in the data, so 85, 90, and 99 are 82, 83 and 84 respectively.
Other variables have special meanings behind certain codes. For example, often missing or NIU values are indicated in IPUMS by values starting with the number 9 that are offset from the typical values. R’s factors do not allow for this separation, so the missing codes will be harder to distinguish.
Factors also require that every value be labelled, which is not always true in IPUMS data. In the AGE variable, the only values with labels are 0, 90 and 99. For all other values, there is not additional label information.
Though the labelled class does express all of the meaning provided by IPUMS value labels into R, many R functions cannot use them or even actively remove them from the data.
ipums_val_labels(cps$HEALTH)
#> # A tibble: 5 × 2
#> val lbl
#> <int> <chr>
#> 1 1 Excellent
#> 2 2 Very good
#> 3 3 Good
#> 4 4 Fair
#> 5 5 Poor
<- ifelse(cps$HEALTH > 3, 3, cps$HEALTH)
HEALTH2 ipums_val_labels(HEALTH2)
#> # A tibble: 0 × 2
#> # … with 2 variables: val <dbl>, lbl <chr>
Therefore, your first task when importing an IPUMS data set will usually be to convert the labelled values to other data structures. The bad news is that there’s no good automatic way to do this; a lot depends on how you plan to use the variables in your analysis and your preferences.
The good news is that the ipumsr package provides several functions to make this process easier.
I think it is easiest to learn them by seeing them in action, so see below for a workflow for bringing in the CPS example extract. For your reference, here is a list of the functions:
as_factor()
(reexported from the haven package)zap_labels()
(also from haven)lbl_na_if()
lbl_collapse()
lbl_relabel()
lbl_add()
lbl_add_vals()
as_factor()
The HEALTH variable is structured just like a factor and so can be
converted directly. The as_factor()
function is the easiest
way to do so.
ipums_val_labels(cps$HEALTH)
#> # A tibble: 5 × 2
#> val lbl
#> <int> <chr>
#> 1 1 Excellent
#> 2 2 Very good
#> 3 3 Good
#> 4 4 Fair
#> 5 5 Poor
$HEALTH <- as_factor(cps$HEALTH) cps
The ASECFLAG and MONTH variables can also be converted directly (these were included by default by the IPUMS extract engine, but aren’t useful here because this data set only has respondents from ASEC in March)
$ASECFLAG <- as_factor(cps$ASECFLAG)
cps$MONTH <- as_factor(cps$MONTH) cps
as_factor
works on data.frames by converting every
labelled variable to a factor. However, this can create confusing
variables like AGE_FACTOR from above, so this isn’t the best thing to do
right away.
zap_labels()
I may decide that for my analysis, the AGE variable is most useful as
the numeric values. The zap_labels()
function removes the
labels.
$AGE <- zap_labels(cps$AGE) cps
The top-codes (which are only available on the CPS website) indicate that the value 80 actually indicates 80-84 and 85 indicates 85+, so another option would be to convert them to a factor with age ranges.
lbl_clean()
This extract only contains data from a few states and I don’t want
the factor to have a level for the unused ones. The
lbl_clean()
function keeps only labels for values in the
current data set and returns a labelled variable which we can convert to
a factor.
ipums_val_labels(cps$STATEFIP)
#> # A tibble: 75 × 2
#> val lbl
#> <int> <chr>
#> 1 1 Alabama
#> 2 2 Alaska
#> 3 4 Arizona
#> 4 5 Arkansas
#> 5 6 California
#> 6 8 Colorado
#> 7 9 Connecticut
#> 8 10 Delaware
#> 9 11 District of Columbia
#> 10 12 Florida
#> # … with 65 more rows
$STATEFIP <- lbl_clean(cps$STATEFIP)
cps
ipums_val_labels(cps$STATEFIP)
#> # A tibble: 5 × 2
#> val lbl
#> <int> <chr>
#> 1 19 Iowa
#> 2 27 Minnesota
#> 3 38 North Dakota
#> 4 46 South Dakota
#> 5 55 Wisconsin
$STATEFIP <- as_factor(cps$STATEFIP) cps
lbl_na_if()
The INCTOT variable has 2 labelled values that are not actually incomes: 99999998 indicates “Missing” and 99999999 indicate “Not in Universe”. On the CPS website, the Universe tab indicates that the Universe for 2016 is respondents age 15+. Let’s say for my analysis, I can treat these values as missing.
The lbl_na_if()
function takes the variable and a
function that refers to .val and .lbl (the values and labels
respectively) and returns an indicator of whether to set those values to
NA and remove the label. You can also use the ~
notation
from the purrr package to create succinct anonymous functions.
The .val and .lbl only refer to values that already have labels, they
do not apply to unlabeled values. See the lbl_add()
and
lbl_add_vals()
functions below for working with unlabeled
values.
# Caution: R defaults to printing large numbers like 99999999 in rounded
# exponential format (1e+08) but that's not how they are actually stored
ipums_val_labels(cps$INCTOT)
#> # A tibble: 2 × 2
#> val lbl
#> <dbl> <chr>
#> 1 99999998 Missing.
#> 2 99999999 N.I.U. (Not in Universe).
# All of these are equivalent
<- lbl_na_if(cps$INCTOT, ~.val >= 99999990)
INCTOT1 <- lbl_na_if(cps$INCTOT, ~.lbl %in% c("Missing.", "N.I.U. (Not in Universe)."))
INCTOT2 <- lbl_na_if(cps$INCTOT, function(.val, .lbl) {
INCTOT3 <- .val == 99999998
is_missing <- .lbl == "N.I.U. (Not in Universe)."
is_niu return(is_missing | is_niu)
})
# Change to a factor in the original cps data.frame
$INCTOT <- lbl_na_if(cps$INCTOT, ~.val >= 9999990)
cps$INCTOT <- as_factor(cps$INCTOT) cps
lbl_collapse()
The EDUC variable provides an example of a common IPUMS practice of grouping categories together by the starting digits. For example, the value 10 indicates “Grades 1, 2, 3, or 4”, and 11 - “Grade 1”, 12 - “Grade 2”, etc.
Let’s say that I only care about those more general categories
provided by the first 2 digits. The lbl_collapse()
function
allows me to provide a function that takes .val and .lbl and returns the
value to assign it to. If that code is already used, then the all of the
values assigned to it will get that label, otherwise the label of the
smallest value is used. Just like with lbl_na_if()
, the
purrr-style compact syntax using ~
functions is
supported.
ipums_val_labels(cps$EDUC)
#> # A tibble: 36 × 2
#> val lbl
#> <int> <chr>
#> 1 0 NIU or no schooling
#> 2 1 NIU or blank
#> 3 2 None or preschool
#> 4 10 Grades 1, 2, 3, or 4
#> 5 11 Grade 1
#> 6 12 Grade 2
#> 7 13 Grade 3
#> 8 14 Grade 4
#> 9 20 Grades 5 or 6
#> 10 21 Grade 5
#> # … with 26 more rows
# %/% is integer division, which divides by the number but doesn't keep the remainder
$EDUC <- lbl_collapse(cps$EDUC, ~.val %/% 10)
cps
ipums_val_labels(cps$EDUC)
#> # A tibble: 14 × 2
#> val lbl
#> <dbl> <chr>
#> 1 0 NIU or no schooling
#> 2 1 Grades 1, 2, 3, or 4
#> 3 2 Grades 5 or 6
#> 4 3 Grades 7 or 8
#> 5 4 Grade 9
#> 6 5 Grade 10
#> 7 6 Grade 11
#> 8 7 Grade 12
#> 9 8 1 year of college
#> 10 9 2 years of college
#> 11 10 3 years of college
#> 12 11 4 years of college
#> 13 12 5+ years of college
#> 14 99 Missing/Unknown
$EDUC <- as_factor(cps$EDUC) cps
lbl_relabel()
Sometimes you may wish to move the labels into new categories. For example, the categories in MIGRATE1 may not quite map what I want to use in my analysis.
The lbl_relabel()
function provides a more flexible way
to group existing labelled values into new ones. It takes a two-sided
formula, where the left-hand side is a label (defined with the
lbl()
function) and the right hand side is an expression
that can use .val and .lbl to evaluate to a logical indicating which
values should be assigned to this label.
ipums_val_labels(cps$MIGRATE1)
#> # A tibble: 8 × 2
#> val lbl
#> <int> <chr>
#> 1 0 NIU
#> 2 1 Same house
#> 3 2 Different house, place not reported
#> 4 3 Moved within county
#> 5 4 Moved within state, different county
#> 6 5 Moved between states
#> 7 6 Abroad
#> 8 9 Unknown
$MIGRATE1 <- lbl_relabel(
cps$MIGRATE1,
cpslbl(0, "NIU / Missing / Unknown") ~ .val %in% c(0, 2, 9),
lbl(1, "Stayed in state") ~ .val %in% c(1, 3, 4)
)
ipums_val_labels(cps$MIGRATE1)
#> # A tibble: 4 × 2
#> val lbl
#> <dbl> <chr>
#> 1 0 NIU / Missing / Unknown
#> 2 1 Stayed in state
#> 3 5 Moved between states
#> 4 6 Abroad
$MIGRATE1 <- as_factor(cps$MIGRATE1) cps
lbl_add()
and lbl_add_vals()
These functions allow you to create labels for values that aren’t already labelled. It’s harder to come up with real world examples of when these functions would be useful, but just in case you come across such a situation, here’s how they work.
<- haven::labelled(
x c(100, 200, 105, 990, 999, 230),
c(`Unknown` = 990, NIU = 999)
)
lbl_add(x, lbl(100, "$100"), lbl(105, "$105"), lbl(200, "$200"), lbl(230, "$230"))
#> <labelled<double>[6]>
#> [1] 100 200 105 990 999 230
#>
#> Labels:
#> value label
#> 100 $100
#> 105 $105
#> 200 $200
#> 230 $230
#> 990 Unknown
#> 999 NIU
lbl_add_vals(x, ~paste0("$", .))
#> <labelled<double>[6]>
#> [1] 100 200 105 990 999 230
#>
#> Labels:
#> value label
#> 100 $100
#> 105 $105
#> 200 $200
#> 230 $230
#> 990 Unknown
#> 999 NIU
And now, after converting all those labels to factors, I’m ready for analysis! If you think of any other helper functions that would be useful for dealing with labels please let us know by filing an issue on github.
One implementation detail that may help you understand the lbl_* functions better is that the value labels are stored separately from the actual data. This can be important because it allows for values to exist in the data without labels (such as the non-special codes in the INCTOT variable from the example above) and also for value labels to exist even if they don’t exist in the data (like the STATEFIPS that we didn’t include in our extract).
The .val
and .lbl
pronouns that are usable
in functions like lbl_na_if()
, lbl_collapse()
and lbl_relabel()
only include these labelled values,
not all values in the dataset. Though this makes many
calculations simpler, because only the labelled values are considered,
it can be confusing when you want to work with the unlabeled values.
For example, considering the INCTOT variable, if there are unlabeled
values that you want to set to NA, you cannot use
lbl_na_if()
directly.
# Reload cps data so that INCTOT is a labelled class again
<- read_ipums_micro(ddi, verbose = FALSE)
cps
# Try to set all values above 1000000 to NA
<- lbl_na_if(cps$INCTOT, ~.val > 1000000)
test1 <- zap_labels(test1)
test1 max(test1, na.rm = TRUE)
#> [1] 1230006
# Didn't work
Instead you should add the value labels with
lbl_add_vals()
(or you could use a function that doesn’t
use the labels, such as dplyr::na_if()
)
<- lbl_add_vals(cps$INCTOT)
test2 <- lbl_na_if(test2, ~.val > 1000000)
test2 <- zap_labels(test2)
test2 max(test2, na.rm = TRUE)
#> [1] 990003
The haven package vignette ‘semantics’ has some more details about
the motivation and implementation of the labelled class. You can view it
by running the command:
vignette("semantics", package = "haven")
The labelled
package provides other methods for manipulating value labels. It is not
installed by ipumsr, but is available on CRAN via the following command:
install.packages("labelled")
The questionr package
includes great functions for exploring labelled
variables.
In particular, the functions describe
, freq
and lookfor
all print out to console information about the
variable using the value labels. It is also not installed by ipumsr, but
can be installed from CRAN using:
install.packages("questionr")
Finally, the foreign and prettyR packages
don’t use the labelled
class data structure from haven
(which ipumsr uses), but do have very similar concepts for attaching
value labels. Code designed for these packages could be adapted for use
with the haven labelled class without too much difficulty.