This vignette describes the use of the new
pivot_longer()
and pivot_wider()
functions.
Their goal is to improve the usability of gather()
and
spread()
, and incorporate state-of-the-art features found
in other packages.
For some time, it’s been obvious that there is something
fundamentally wrong with the design of spread()
and
gather()
. Many people don’t find the names intuitive and
find it hard to remember which direction corresponds to spreading and
which to gathering. It also seems surprisingly hard to remember the
arguments to these functions, meaning that many people (including me!)
have to consult the documentation every time.
There are two important new features inspired by other R packages that have been advancing reshaping in R:
pivot_longer()
can work with multiple value
variables that may have different types, inspired by the enhanced
melt()
and dcast()
functions provided by the
data.table
package by Matt Dowle and Arun Srinivasan.
pivot_longer()
and pivot_wider()
can
take a data frame that specifies precisely how metadata stored in column
names becomes data variables (and vice versa), inspired by the cdata package by John
Mount and Nina Zumel.
In this vignette, you’ll learn the key ideas behind
pivot_longer()
and pivot_wider()
as you see
them used to solve a variety of data reshaping challenges ranging from
simple to complex.
To begin we’ll load some needed packages. In real analysis code, I’d
imagine you’d do with the library(tidyverse)
, but I can’t
do that here since this vignette is embedded in a package.
library(tidyr)
library(dplyr)
library(readr)
pivot_longer()
makes datasets longer by
increasing the number of rows and decreasing the number of columns. I
don’t believe it makes sense to describe a dataset as being in “long
form”. Length is a relative term, and you can only say (e.g.) that
dataset A is longer than dataset B.
pivot_longer()
is commonly needed to tidy wild-caught
datasets as they often optimise for ease of data entry or ease of
comparison rather than ease of analysis. The following sections show how
to use pivot_longer()
for a wide range of realistic
datasets.
The relig_income
dataset stores counts based on a survey
which (among other things) asked people about their religion and annual
income:
relig_income#> # A tibble: 18 × 11
#> religion `<$10k` $10-2…¹ $20-3…² $30-4…³ $40-5…⁴ $50-7…⁵ $75-1…⁶ $100-…⁷
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Agnostic 27 34 60 81 76 137 122 109
#> 2 Atheist 12 27 37 52 35 70 73 59
#> 3 Buddhist 27 21 30 34 33 58 62 39
#> 4 Catholic 418 617 732 670 638 1116 949 792
#> 5 Don’t know/r… 15 14 15 11 10 35 21 17
#> 6 Evangelical … 575 869 1064 982 881 1486 949 723
#> 7 Hindu 1 9 7 9 11 34 47 48
#> 8 Historically… 228 244 236 238 197 223 131 81
#> 9 Jehovah's Wi… 20 27 24 24 21 30 15 11
#> 10 Jewish 19 19 25 25 30 95 69 87
#> # … with 8 more rows, 2 more variables: `>150k` <dbl>,
#> # `Don't know/refused` <dbl>, and abbreviated variable names ¹`$10-20k`,
#> # ²`$20-30k`, ³`$30-40k`, ⁴`$40-50k`, ⁵`$50-75k`, ⁶`$75-100k`, ⁷`$100-150k`
This dataset contains three variables:
religion
, stored in the rows,income
spread across the column names, andcount
stored in the cell values.To tidy it we use pivot_longer()
:
%>%
relig_income pivot_longer(!religion, names_to = "income", values_to = "count")
#> # A tibble: 180 × 3
#> religion income count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # … with 170 more rows
The first argument is the dataset to reshape,
relig_income
.
The second argument describes which columns need to be reshaped.
In this case, it’s every column apart from
religion
.
The names_to
gives the name of the variable that
will be created from the data stored in the column names,
i.e. income
.
The values_to
gives the name of the variable that
will be created from the data stored in the cell value,
i.e. count
.
Neither the names_to
nor the values_to
column exists in relig_income
, so we provide them as
character strings surrounded in quotes.
The billboard
dataset records the billboard rank of
songs in the year 2000. It has a form similar to the
relig_income
data, but the data encoded in the column names
is really a number, not a string.
billboard#> # A tibble: 317 × 79
#> artist track date.ent…¹ wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 wk9
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA NA
#> 2 2Ge+h… The … 2000-09-02 91 87 92 NA NA NA NA NA NA
#> 3 3 Doo… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 51
#> 4 3 Doo… Loser 2000-10-21 76 76 72 69 67 65 55 59 62
#> 5 504 B… Wobb… 2000-04-15 57 34 25 17 17 31 36 49 53
#> 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 3
#> 7 A*Tee… Danc… 2000-07-08 97 97 96 95 100 NA NA NA NA
#> 8 Aaliy… I Do… 2000-01-29 84 62 51 41 38 35 35 38 38
#> 9 Aaliy… Try … 2000-03-18 59 53 38 28 21 18 16 14 12
#> 10 Adams… Open… 2000-08-26 76 76 74 69 68 67 61 58 57
#> # … with 307 more rows, 67 more variables: wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#> # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#> # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#> # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#> # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#> # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
#> # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
We can start with the same basic specification as for the
relig_income
dataset. Here we want the names to become a
variable called week
, and the values to become a variable
called rank
. I also use values_drop_na
to drop
rows that correspond to missing values. Not every song stays in the
charts for all 76 weeks, so the structure of the input data force the
creation of unnecessary explicit NA
s.
%>%
billboard pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)#> # A tibble: 5,307 × 5
#> artist track date.entered week rank
#> <chr> <chr> <date> <chr> <dbl>
#> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
#> 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
#> 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
#> 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
#> 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
#> 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
#> 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
#> 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
#> 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
#> 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
#> # … with 5,297 more rows
It would be nice to easily determine how long each song stayed in the
charts, but to do that, we’ll need to convert the week
variable to an integer. We can do that by using two additional
arguments: names_prefix
strips off the wk
prefix, and names_transform
converts week
into
an integer:
%>%
billboard pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_prefix = "wk",
names_transform = list(week = as.integer),
values_to = "rank",
values_drop_na = TRUE,
)
Alternatively, you could do this with a single argument by using
readr::parse_number()
which automatically strips
non-numeric components:
%>%
billboard pivot_longer(
cols = starts_with("wk"),
names_to = "week",
names_transform = list(week = readr::parse_number),
values_to = "rank",
values_drop_na = TRUE,
)
A more challenging situation occurs when you have multiple variables
crammed into the column names. For example, take the who
dataset:
who#> # A tibble: 7,240 × 60
#> country iso2 iso3 year new_s…¹ new_s…² new_s…³ new_s…⁴ new_s…⁵ new_s…⁶
#> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int>
#> 1 Afghanistan AF AFG 1980 NA NA NA NA NA NA
#> 2 Afghanistan AF AFG 1981 NA NA NA NA NA NA
#> 3 Afghanistan AF AFG 1982 NA NA NA NA NA NA
#> 4 Afghanistan AF AFG 1983 NA NA NA NA NA NA
#> 5 Afghanistan AF AFG 1984 NA NA NA NA NA NA
#> 6 Afghanistan AF AFG 1985 NA NA NA NA NA NA
#> 7 Afghanistan AF AFG 1986 NA NA NA NA NA NA
#> 8 Afghanistan AF AFG 1987 NA NA NA NA NA NA
#> 9 Afghanistan AF AFG 1988 NA NA NA NA NA NA
#> 10 Afghanistan AF AFG 1989 NA NA NA NA NA NA
#> # … with 7,230 more rows, 50 more variables: new_sp_m65 <int>,
#> # new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
#> # new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
#> # new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
#> # new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
#> # new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
#> # new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>, …
country
, iso2
, iso3
, and
year
are already variables, so they can be left as is. But
the columns from new_sp_m014
to newrel_f65
encode four variables in their names:
The new_
/new
prefix indicates these are
counts of new cases. This dataset only contains new cases, so we’ll
ignore it here because it’s constant.
sp
/rel
/ep
describe how the
case was diagnosed.
m
/f
gives the gender.
014
/1524
/2535
/3544
/4554
/65
supplies the age range.
We can break these variables up by specifying multiple column names
in names_to
, and then either providing
names_sep
or names_pattern
. Here
names_pattern
is the most natural fit. It has a similar
interface to extract
: you give it a regular expression
containing groups (defined by ()
) and it puts each group in
a column.
%>% pivot_longer(
who cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
)#> # A tibble: 405,440 × 8
#> country iso2 iso3 year diagnosis gender age count
#> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int>
#> 1 Afghanistan AF AFG 1980 sp m 014 NA
#> 2 Afghanistan AF AFG 1980 sp m 1524 NA
#> 3 Afghanistan AF AFG 1980 sp m 2534 NA
#> 4 Afghanistan AF AFG 1980 sp m 3544 NA
#> 5 Afghanistan AF AFG 1980 sp m 4554 NA
#> 6 Afghanistan AF AFG 1980 sp m 5564 NA
#> 7 Afghanistan AF AFG 1980 sp m 65 NA
#> 8 Afghanistan AF AFG 1980 sp f 014 NA
#> 9 Afghanistan AF AFG 1980 sp f 1524 NA
#> 10 Afghanistan AF AFG 1980 sp f 2534 NA
#> # … with 405,430 more rows
We could go one step further use readr functions to convert the gender and age to factors. I think this is good practice when you have categorical variables with a known set of values.
%>% pivot_longer(
who cols = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
names_transform = list(
gender = ~ readr::parse_factor(.x, levels = c("f", "m")),
age = ~ readr::parse_factor(
.x,levels = c("014", "1524", "2534", "3544", "4554", "5564", "65"),
ordered = TRUE
)
),values_to = "count",
)
So far, we have been working with data frames that have one observation per row, but many important pivotting problems involve multiple observations per row. You can usually recognise this case because name of the column that you want to appear in the output is part of the column name in the input. In this section, you’ll learn how to pivot this sort of data.
The following example is adapted from the data.table vignette, as inspiration for tidyr’s solution to this problem.
<- tribble(
family ~family, ~dob_child1, ~dob_child2, ~gender_child1, ~gender_child2,
"1998-11-26", "2000-01-29", 1L, 2L,
1L, "1996-06-22", NA, 2L, NA,
2L, "2002-07-11", "2004-04-05", 2L, 2L,
3L, "2004-10-10", "2009-08-27", 1L, 1L,
4L, "2000-12-05", "2005-02-28", 2L, 1L,
5L,
)<- family %>% mutate_at(vars(starts_with("dob")), parse_date)
family
family#> # A tibble: 5 × 5
#> family dob_child1 dob_child2 gender_child1 gender_child2
#> <int> <date> <date> <int> <int>
#> 1 1 1998-11-26 2000-01-29 1 2
#> 2 2 1996-06-22 NA 2 NA
#> 3 3 2002-07-11 2004-04-05 2 2
#> 4 4 2004-10-10 2009-08-27 1 1
#> 5 5 2000-12-05 2005-02-28 2 1
Note that we have two pieces of information (or values) for each
child: their gender
and their dob
(date of
birth). These need to go into separate columns in the result. Again we
supply multiple variables to names_to
, using
names_sep
to split up each variable name. Note the special
name .value
: this tells pivot_longer()
that
that part of the column name specifies the “value” being measured (which
will become a variable in the output).
%>%
family pivot_longer(
!family,
names_to = c(".value", "child"),
names_sep = "_",
values_drop_na = TRUE
)#> # A tibble: 9 × 4
#> family child dob gender
#> <int> <chr> <date> <int>
#> 1 1 child1 1998-11-26 1
#> 2 1 child2 2000-01-29 2
#> 3 2 child1 1996-06-22 2
#> 4 3 child1 2002-07-11 2
#> 5 3 child2 2004-04-05 2
#> 6 4 child1 2004-10-10 1
#> 7 4 child2 2009-08-27 1
#> 8 5 child1 2000-12-05 2
#> 9 5 child2 2005-02-28 1
Note the use of values_drop_na = TRUE
: the input shape
forces the creation of explicit missing variables for observations that
don’t exist.
This problem also exists in the anscombe
dataset built
in to base R:
anscombe#> x1 x2 x3 x4 y1 y2 y3 y4
#> 1 10 10 10 8 8.04 9.14 7.46 6.58
#> 2 8 8 8 8 6.95 8.14 6.77 5.76
#> 3 13 13 13 8 7.58 8.74 12.74 7.71
#> 4 9 9 9 8 8.81 8.77 7.11 8.84
#> 5 11 11 11 8 8.33 9.26 7.81 8.47
#> 6 14 14 14 8 9.96 8.10 8.84 7.04
#> 7 6 6 6 8 7.24 6.13 6.08 5.25
#> 8 4 4 4 19 4.26 3.10 5.39 12.50
#> 9 12 12 12 8 10.84 9.13 8.15 5.56
#> 10 7 7 7 8 4.82 7.26 6.42 7.91
#> 11 5 5 5 8 5.68 4.74 5.73 6.89
This dataset contains four pairs of variables (x1
and
y1
, x2
and y2
, etc) that underlie
Anscombe’s quartet, a collection of four datasets that have the same
summary statistics (mean, sd, correlation etc), but have quite different
data. We want to produce a dataset with columns set
,
x
and y
.
%>%
anscombe pivot_longer(everything(),
names_to = c(".value", "set"),
names_pattern = "(.)(.)"
%>%
) arrange(set)
#> # A tibble: 44 × 3
#> set x y
#> <chr> <dbl> <dbl>
#> 1 1 10 8.04
#> 2 1 8 6.95
#> 3 1 13 7.58
#> 4 1 9 8.81
#> 5 1 11 8.33
#> 6 1 14 9.96
#> 7 1 6 7.24
#> 8 1 4 4.26
#> 9 1 12 10.8
#> 10 1 7 4.82
#> # … with 34 more rows
A similar situation can arise with panel data. For example, take this
example dataset provided by Thomas Leeper. We
can tidy it using the same approach as for anscombe
:
<- tibble(
pnl x = 1:4,
a = c(1, 1,0, 0),
b = c(0, 1, 1, 1),
y1 = rnorm(4),
y2 = rnorm(4),
z1 = rep(3, 4),
z2 = rep(-2, 4),
)
%>%
pnl pivot_longer(
!c(x, a, b),
names_to = c(".value", "time"),
names_pattern = "(.)(.)"
)#> # A tibble: 8 × 6
#> x a b time y z
#> <int> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 1 0 1 -1.18 3
#> 2 1 1 0 2 0.804 -2
#> 3 2 1 1 1 -1.19 3
#> 4 2 1 1 2 0.0962 -2
#> 5 3 0 1 1 1.78 3
#> 6 3 0 1 2 0.420 -2
#> 7 4 0 1 1 0.933 3
#> 8 4 0 1 2 1.07 -2
Occassionally you will come across datasets that have duplicated column names. Generally, such datasets are hard to work with in R, because when you refer to a column by name it only finds the first match. To create a tibble with duplicated names, you have to explicitly opt out of the name repair that usually prevents you from creating such a dataset:
<- tibble(id = 1:3, y = 4:6, y = 5:7, y = 7:9, .name_repair = "minimal")
df
df#> # A tibble: 3 × 4
#> id y y y
#> <int> <int> <int> <int>
#> 1 1 4 5 7
#> 2 2 5 6 8
#> 3 3 6 7 9
When pivot_longer()
encounters such data, it
automatically adds another column to the output:
%>% pivot_longer(!id, names_to = "name", values_to = "value")
df #> # A tibble: 9 × 3
#> id name value
#> <int> <chr> <int>
#> 1 1 y 4
#> 2 1 y 5
#> 3 1 y 7
#> 4 2 y 5
#> 5 2 y 6
#> 6 2 y 8
#> 7 3 y 6
#> 8 3 y 7
#> 9 3 y 9
A similar process is applied when multiple input columns are mapped to the same output column, as in the following example where we ignore the numeric suffix on each column name:
<- tibble(id = 1:3, x1 = 4:6, x2 = 5:7, y1 = 7:9, y2 = 10:12)
df %>% pivot_longer(!id, names_to = ".value", names_pattern = "(.).")
df #> # A tibble: 6 × 3
#> id x y
#> <int> <int> <int>
#> 1 1 4 7
#> 2 1 5 10
#> 3 2 5 8
#> 4 2 6 11
#> 5 3 6 9
#> 6 3 7 12
pivot_wider()
is the opposite of
pivot_longer()
: it makes a dataset wider
by increasing the number of columns and decreasing the number of rows.
It’s relatively rare to need pivot_wider()
to make tidy
data, but it’s often useful for creating summary tables for
presentation, or data in a format needed by other tools.
The fish_encounters
dataset, contributed by Myfanwy
Johnston, describes when fish swimming down a river are detected by
automatic monitoring stations:
fish_encounters#> # A tibble: 114 × 3
#> fish station seen
#> <fct> <fct> <int>
#> 1 4842 Release 1
#> 2 4842 I80_1 1
#> 3 4842 Lisbon 1
#> 4 4842 Rstr 1
#> 5 4842 Base_TD 1
#> 6 4842 BCE 1
#> 7 4842 BCW 1
#> 8 4842 BCE2 1
#> 9 4842 BCW2 1
#> 10 4842 MAE 1
#> # … with 104 more rows
Many tools used to analyse this data need it in a form where each station is a column:
%>% pivot_wider(names_from = station, values_from = seen)
fish_encounters #> # A tibble: 19 × 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 NA NA NA NA NA NA
#> 5 4847 1 1 1 NA NA NA NA NA NA NA NA
#> 6 4848 1 1 1 1 NA NA NA NA NA NA NA
#> 7 4849 1 1 NA NA NA NA NA NA NA NA NA
#> 8 4850 1 1 NA 1 1 1 1 NA NA NA NA
#> 9 4851 1 1 NA NA NA NA NA NA NA NA NA
#> 10 4854 1 1 NA NA NA NA NA NA NA NA NA
#> # … with 9 more rows
This dataset only records when a fish was detected by the station -
it doesn’t record when it wasn’t detected (this is common with this type
of data). That means the output data is filled with NA
s.
However, in this case we know that the absence of a record means that
the fish was not seen
, so we can ask
pivot_wider()
to fill these missing values in with
zeros:
%>% pivot_wider(
fish_encounters names_from = station,
values_from = seen,
values_fill = 0
)#> # A tibble: 19 × 12
#> fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
#> <fct> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 4842 1 1 1 1 1 1 1 1 1 1 1
#> 2 4843 1 1 1 1 1 1 1 1 1 1 1
#> 3 4844 1 1 1 1 1 1 1 1 1 1 1
#> 4 4845 1 1 1 1 1 0 0 0 0 0 0
#> 5 4847 1 1 1 0 0 0 0 0 0 0 0
#> 6 4848 1 1 1 1 0 0 0 0 0 0 0
#> 7 4849 1 1 0 0 0 0 0 0 0 0 0
#> 8 4850 1 1 0 1 1 1 1 0 0 0 0
#> 9 4851 1 1 0 0 0 0 0 0 0 0 0
#> 10 4854 1 1 0 0 0 0 0 0 0 0 0
#> # … with 9 more rows
You can also use pivot_wider()
to perform simple
aggregation. For example, take the warpbreaks
dataset built
in to base R (converted to a tibble for the better print method):
<- warpbreaks %>% as_tibble() %>% select(wool, tension, breaks)
warpbreaks
warpbreaks#> # A tibble: 54 × 3
#> wool tension breaks
#> <fct> <fct> <dbl>
#> 1 A L 26
#> 2 A L 30
#> 3 A L 54
#> 4 A L 25
#> 5 A L 70
#> 6 A L 52
#> 7 A L 51
#> 8 A L 26
#> 9 A L 67
#> 10 A M 18
#> # … with 44 more rows
This is a designed experiment with nine replicates for every
combination of wool
(A
and B
) and
tension
(L
, M
,
H
):
%>% count(wool, tension)
warpbreaks #> # A tibble: 6 × 3
#> wool tension n
#> <fct> <fct> <int>
#> 1 A L 9
#> 2 A M 9
#> 3 A H 9
#> 4 B L 9
#> 5 B M 9
#> 6 B H 9
What happens if we attempt to pivot the levels of wool
into the columns?
%>% pivot_wider(names_from = wool, values_from = breaks)
warpbreaks #> Warning: Values from `breaks` are not uniquely identified; output will contain list-cols.
#> * Use `values_fn = list` to suppress this warning.
#> * Use `values_fn = {summary_fun}` to summarise duplicates.
#> * Use the following dplyr code to identify duplicates.
#> {data} %>%
#> dplyr::group_by(tension, wool) %>%
#> dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
#> dplyr::filter(n > 1L)
#> # A tibble: 3 × 3
#> tension A B
#> <fct> <list> <list>
#> 1 L <dbl [9]> <dbl [9]>
#> 2 M <dbl [9]> <dbl [9]>
#> 3 H <dbl [9]> <dbl [9]>
We get a warning that each cell in the output corresponds to multiple
cells in the input. The default behaviour produces list-columns, which
contain all the individual values. A more useful output would be summary
statistics, e.g. mean
breaks for each combination of wool
and tension:
%>%
warpbreaks pivot_wider(
names_from = wool,
values_from = breaks,
values_fn = list(breaks = mean)
)#> # A tibble: 3 × 3
#> tension A B
#> <fct> <dbl> <dbl>
#> 1 L 44.6 28.2
#> 2 M 24 28.8
#> 3 H 24.6 18.8
For more complex summary operations, I recommend summarising before
reshaping, but for simple cases it’s often convenient to summarise
within pivot_wider()
.
Imagine, as in https://stackoverflow.com/questions/24929954, that we have information containing the combination of product, country, and year. In tidy form it might look like this:
<- expand_grid(
production product = c("A", "B"),
country = c("AI", "EI"),
year = 2000:2014
%>%
) filter((product == "A" & country == "AI") | product == "B") %>%
mutate(production = rnorm(nrow(.)))
production#> # A tibble: 45 × 4
#> product country year production
#> <chr> <chr> <int> <dbl>
#> 1 A AI 2000 -1.76
#> 2 A AI 2001 -0.926
#> 3 A AI 2002 0.198
#> 4 A AI 2003 -1.61
#> 5 A AI 2004 2.20
#> 6 A AI 2005 0.168
#> 7 A AI 2006 0.446
#> 8 A AI 2007 1.40
#> 9 A AI 2008 1.33
#> 10 A AI 2009 -1.58
#> # … with 35 more rows
We want to widen the data so we have one column for each combination
of product
and country
. The key is to specify
multiple variables for names_from
:
%>% pivot_wider(
production names_from = c(product, country),
values_from = production
)#> # A tibble: 15 × 4
#> year A_AI B_AI B_EI
#> <int> <dbl> <dbl> <dbl>
#> 1 2000 -1.76 -1.73 0.991
#> 2 2001 -0.926 0.226 -0.0261
#> 3 2002 0.198 -2.01 1.09
#> 4 2003 -1.61 0.409 -1.23
#> 5 2004 2.20 -0.502 1.96
#> 6 2005 0.168 -0.325 -0.0227
#> 7 2006 0.446 -0.0750 -0.435
#> 8 2007 1.40 -0.838 0.201
#> 9 2008 1.33 0.132 -0.126
#> 10 2009 -1.58 -0.517 -1.22
#> # … with 5 more rows
When either names_from
or values_from
select multiple variables, you can control how the column names in the
output constructed with names_sep
and
names_prefix
, or the workhorse names_glue
:
%>% pivot_wider(
production names_from = c(product, country),
values_from = production,
names_sep = ".",
names_prefix = "prod."
)#> # A tibble: 15 × 4
#> year prod.A.AI prod.B.AI prod.B.EI
#> <int> <dbl> <dbl> <dbl>
#> 1 2000 -1.76 -1.73 0.991
#> 2 2001 -0.926 0.226 -0.0261
#> 3 2002 0.198 -2.01 1.09
#> 4 2003 -1.61 0.409 -1.23
#> 5 2004 2.20 -0.502 1.96
#> 6 2005 0.168 -0.325 -0.0227
#> 7 2006 0.446 -0.0750 -0.435
#> 8 2007 1.40 -0.838 0.201
#> 9 2008 1.33 0.132 -0.126
#> 10 2009 -1.58 -0.517 -1.22
#> # … with 5 more rows
%>% pivot_wider(
production names_from = c(product, country),
values_from = production,
names_glue = "prod_{product}_{country}"
)#> # A tibble: 15 × 4
#> year prod_A_AI prod_B_AI prod_B_EI
#> <int> <dbl> <dbl> <dbl>
#> 1 2000 -1.76 -1.73 0.991
#> 2 2001 -0.926 0.226 -0.0261
#> 3 2002 0.198 -2.01 1.09
#> 4 2003 -1.61 0.409 -1.23
#> 5 2004 2.20 -0.502 1.96
#> 6 2005 0.168 -0.325 -0.0227
#> 7 2006 0.446 -0.0750 -0.435
#> 8 2007 1.40 -0.838 0.201
#> 9 2008 1.33 0.132 -0.126
#> 10 2009 -1.58 -0.517 -1.22
#> # … with 5 more rows
The us_rent_income
dataset contains information about
median income and rent for each state in the US for 2017 (from the
American Community Survey, retrieved with the tidycensus package).
us_rent_income#> # A tibble: 104 × 5
#> GEOID NAME variable estimate moe
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 01 Alabama income 24476 136
#> 2 01 Alabama rent 747 3
#> 3 02 Alaska income 32940 508
#> 4 02 Alaska rent 1200 13
#> 5 04 Arizona income 27517 148
#> 6 04 Arizona rent 972 4
#> 7 05 Arkansas income 23789 165
#> 8 05 Arkansas rent 709 5
#> 9 06 California income 29454 109
#> 10 06 California rent 1358 3
#> # … with 94 more rows
Here both estimate
and moe
are values
columns, so we can supply them to values_from
:
%>%
us_rent_income pivot_wider(names_from = variable, values_from = c(estimate, moe))
#> # A tibble: 52 × 6
#> GEOID NAME estimate_income estimate_rent moe_income moe_rent
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 747 136 3
#> 2 02 Alaska 32940 1200 508 13
#> 3 04 Arizona 27517 972 148 4
#> 4 05 Arkansas 23789 709 165 5
#> 5 06 California 29454 1358 109 3
#> 6 08 Colorado 32401 1125 109 5
#> 7 09 Connecticut 35326 1123 195 5
#> 8 10 Delaware 31560 1076 247 10
#> 9 11 District of Columbia 43198 1424 681 17
#> 10 12 Florida 25952 1077 70 3
#> # … with 42 more rows
Note that the name of the variable is automatically appended to the output columns.
Occasionally, you’ll come across data where your names variable is encoded as a factor, but not all of the data will be represented.
<- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
weekdays
<- tibble(
daily day = factor(c("Tue", "Thu", "Fri", "Mon"), levels = weekdays),
value = c(2, 3, 1, 5)
)
daily#> # A tibble: 4 × 2
#> day value
#> <fct> <dbl>
#> 1 Tue 2
#> 2 Thu 3
#> 3 Fri 1
#> 4 Mon 5
pivot_wider()
defaults to generating columns from the
values that are actually represented in the data, but you might want to
include a column for each possible level in case the data changes in the
future.
pivot_wider(daily, names_from = day, values_from = value)
#> # A tibble: 1 × 4
#> Tue Thu Fri Mon
#> <dbl> <dbl> <dbl> <dbl>
#> 1 2 3 1 5
The names_expand
argument will turn implicit factor
levels into explicit ones, forcing them to be represented in the result.
It also sorts the column names using the level order, which produces
more intuitive results in this case.
pivot_wider(daily, names_from = day, values_from = value, names_expand = TRUE)
#> # A tibble: 1 × 7
#> Mon Tue Wed Thu Fri Sat Sun
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 2 NA 3 1 NA NA
If multiple names_from
columns are provided,
names_expand
will generate a Cartesian product of all
possible combinations of the names_from
values. Notice that
the following data has omitted some rows where the percentage value
would be 0
. names_expand
allows us to make
those explicit during the pivot.
<- tibble(
percentages year = c(2018, 2019, 2020, 2020),
type = factor(c("A", "B", "A", "B"), levels = c("A", "B")),
percentage = c(100, 100, 40, 60)
)
percentages#> # A tibble: 4 × 3
#> year type percentage
#> <dbl> <fct> <dbl>
#> 1 2018 A 100
#> 2 2019 B 100
#> 3 2020 A 40
#> 4 2020 B 60
pivot_wider(
percentages,names_from = c(year, type),
values_from = percentage,
names_expand = TRUE,
values_fill = 0
)#> # A tibble: 1 × 6
#> `2018_A` `2018_B` `2019_A` `2019_B` `2020_A` `2020_B`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 100 0 0 100 40 60
A related problem can occur when there are implicit missing factor
levels or combinations in the id_cols
. In this case, there
are missing rows (rather than columns) that you’d like to explicitly
represent. For this example, we’ll modify our daily
data
with a type
column, and pivot on that instead, keeping
day
as an id column.
<- mutate(daily, type = factor(c("A", "B", "B", "A")))
daily
daily#> # A tibble: 4 × 3
#> day value type
#> <fct> <dbl> <fct>
#> 1 Tue 2 A
#> 2 Thu 3 B
#> 3 Fri 1 B
#> 4 Mon 5 A
All of our type
levels are represented in the columns,
but we are missing some rows related to the unrepresented
day
factor levels.
pivot_wider(
daily, names_from = type,
values_from = value,
values_fill = 0
)#> # A tibble: 4 × 3
#> day A B
#> <fct> <dbl> <dbl>
#> 1 Tue 2 0
#> 2 Thu 0 3
#> 3 Fri 0 1
#> 4 Mon 5 0
We can use id_expand
in the same way that we used
names_expand
, which will expand out (and sort) the implicit
missing rows in the id_cols
.
pivot_wider(
daily, names_from = type,
values_from = value,
values_fill = 0,
id_expand = TRUE
)#> # A tibble: 7 × 3
#> day A B
#> <fct> <dbl> <dbl>
#> 1 Mon 5 0
#> 2 Tue 2 0
#> 3 Wed 0 0
#> 4 Thu 0 3
#> 5 Fri 0 1
#> 6 Sat 0 0
#> 7 Sun 0 0
Imagine you’ve found yourself in a situation where you have columns
in your data that are completely unrelated to the pivoting process, but
you’d still like to retain their information somehow. For example, in
updates
we’d like to pivot on the system
column to create one row summaries of each county’s system updates.
<- tibble(
updates county = c("Wake", "Wake", "Wake", "Guilford", "Guilford"),
date = c(as.Date("2020-01-01") + 0:2, as.Date("2020-01-03") + 0:1),
system = c("A", "B", "C", "A", "C"),
value = c(3.2, 4, 5.5, 2, 1.2)
)
updates#> # A tibble: 5 × 4
#> county date system value
#> <chr> <date> <chr> <dbl>
#> 1 Wake 2020-01-01 A 3.2
#> 2 Wake 2020-01-02 B 4
#> 3 Wake 2020-01-03 C 5.5
#> 4 Guilford 2020-01-03 A 2
#> 5 Guilford 2020-01-04 C 1.2
We could do that with a typical pivot_wider()
call, but
we completely lose all information about the date
column.
pivot_wider(
updates, id_cols = county,
names_from = system,
values_from = value
)#> # A tibble: 2 × 4
#> county A B C
#> <chr> <dbl> <dbl> <dbl>
#> 1 Wake 3.2 4 5.5
#> 2 Guilford 2 NA 1.2
For this example, we’d like to retain the most recent update date
across all systems in a particular county. To accomplish that we can use
the unused_fn
argument, which allows us to summarize values
from the columns not utilized in the pivoting process.
pivot_wider(
updates, id_cols = county,
names_from = system,
values_from = value,
unused_fn = list(date = max)
)#> # A tibble: 2 × 5
#> county A B C date
#> <chr> <dbl> <dbl> <dbl> <date>
#> 1 Wake 3.2 4 5.5 2020-01-03
#> 2 Guilford 2 NA 1.2 2020-01-04
You can also retain the data but delay the aggregation entirely by
using list()
as the summary function.
pivot_wider(
updates, id_cols = county,
names_from = system,
values_from = value,
unused_fn = list(date = list)
)#> # A tibble: 2 × 5
#> county A B C date
#> <chr> <dbl> <dbl> <dbl> <list>
#> 1 Wake 3.2 4 5.5 <date [3]>
#> 2 Guilford 2 NA 1.2 <date [2]>
A final challenge is inspired by Jiena Gu. Imagine you have a contact list that you’ve copied and pasted from a website:
<- tribble(
contacts ~field, ~value,
"name", "Jiena McLellan",
"company", "Toyota",
"name", "John Smith",
"company", "google",
"email", "john@google.com",
"name", "Huxley Ratcliffe"
)
This is challenging because there’s no variable that identifies which
observations belong together. We can fix this by noting that every
contact starts with a name, so we can create a unique id by counting
every time we see “name” as the field
:
<- contacts %>%
contacts mutate(
person_id = cumsum(field == "name")
)
contacts#> # A tibble: 6 × 3
#> field value person_id
#> <chr> <chr> <int>
#> 1 name Jiena McLellan 1
#> 2 company Toyota 1
#> 3 name John Smith 2
#> 4 company google 2
#> 5 email john@google.com 2
#> 6 name Huxley Ratcliffe 3
Now that we have a unique identifier for each person, we can pivot
field
and value
into the columns:
%>%
contacts pivot_wider(names_from = field, values_from = value)
#> # A tibble: 3 × 4
#> person_id name company email
#> <int> <chr> <chr> <chr>
#> 1 1 Jiena McLellan Toyota <NA>
#> 2 2 John Smith google john@google.com
#> 3 3 Huxley Ratcliffe <NA> <NA>
Some problems can’t be solved by pivotting in a single direction. The
examples in this section show how you might combine
pivot_longer()
and pivot_wider()
to solve more
complex problems.
world_bank_pop
contains data from the World Bank about
population per country from 2000 to 2018.
world_bank_pop#> # A tibble: 1,056 × 20
#> country indic…¹ `2000` `2001` `2002` `2003` `2004` `2005` `2006` `2007`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ABW SP.URB… 4.24e4 4.30e4 4.37e4 4.42e4 4.47e+4 4.49e+4 4.49e+4 4.47e+4
#> 2 ABW SP.URB… 1.18e0 1.41e0 1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2 -4.35e-1
#> 3 ABW SP.POP… 9.09e4 9.29e4 9.50e4 9.70e4 9.87e+4 1.00e+5 1.01e+5 1.01e+5
#> 4 ABW SP.POP… 2.06e0 2.23e0 2.23e0 2.11e0 1.76e+0 1.30e+0 7.98e-1 3.84e-1
#> 5 AFG SP.URB… 4.44e6 4.65e6 4.89e6 5.16e6 5.43e+6 5.69e+6 5.93e+6 6.15e+6
#> 6 AFG SP.URB… 3.91e0 4.66e0 5.13e0 5.23e0 5.12e+0 4.77e+0 4.12e+0 3.65e+0
#> 7 AFG SP.POP… 2.01e7 2.10e7 2.20e7 2.31e7 2.41e+7 2.51e+7 2.59e+7 2.66e+7
#> 8 AFG SP.POP… 3.49e0 4.25e0 4.72e0 4.82e0 4.47e+0 3.87e+0 3.23e+0 2.76e+0
#> 9 AGO SP.URB… 8.23e6 8.71e6 9.22e6 9.77e6 1.03e+7 1.09e+7 1.15e+7 1.21e+7
#> 10 AGO SP.URB… 5.44e0 5.59e0 5.70e0 5.76e0 5.75e+0 5.69e+0 4.92e+0 4.89e+0
#> # … with 1,046 more rows, 10 more variables: `2008` <dbl>, `2009` <dbl>,
#> # `2010` <dbl>, `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>,
#> # `2015` <dbl>, `2016` <dbl>, `2017` <dbl>, and abbreviated variable name
#> # ¹indicator
My goal is to produce a tidy dataset where each variable is in a column. It’s not obvious exactly what steps are needed yet, but I’ll start with the most obvious problem: year is spread across multiple columns.
<- world_bank_pop %>%
pop2 pivot_longer(`2000`:`2017`, names_to = "year", values_to = "value")
pop2#> # A tibble: 19,008 × 4
#> country indicator year value
#> <chr> <chr> <chr> <dbl>
#> 1 ABW SP.URB.TOTL 2000 42444
#> 2 ABW SP.URB.TOTL 2001 43048
#> 3 ABW SP.URB.TOTL 2002 43670
#> 4 ABW SP.URB.TOTL 2003 44246
#> 5 ABW SP.URB.TOTL 2004 44669
#> 6 ABW SP.URB.TOTL 2005 44889
#> 7 ABW SP.URB.TOTL 2006 44881
#> 8 ABW SP.URB.TOTL 2007 44686
#> 9 ABW SP.URB.TOTL 2008 44375
#> 10 ABW SP.URB.TOTL 2009 44052
#> # … with 18,998 more rows
Next we need to consider the indicator
variable:
%>% count(indicator)
pop2 #> # A tibble: 4 × 2
#> indicator n
#> <chr> <int>
#> 1 SP.POP.GROW 4752
#> 2 SP.POP.TOTL 4752
#> 3 SP.URB.GROW 4752
#> 4 SP.URB.TOTL 4752
Here SP.POP.GROW
is population growth,
SP.POP.TOTL
is total population, and SP.URB.*
are the same but only for urban areas. Let’s split this up into two
variables: area
(total or urban) and the actual variable
(population or growth):
<- pop2 %>%
pop3 separate(indicator, c(NA, "area", "variable"))
pop3#> # A tibble: 19,008 × 5
#> country area variable year value
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 ABW URB TOTL 2000 42444
#> 2 ABW URB TOTL 2001 43048
#> 3 ABW URB TOTL 2002 43670
#> 4 ABW URB TOTL 2003 44246
#> 5 ABW URB TOTL 2004 44669
#> 6 ABW URB TOTL 2005 44889
#> 7 ABW URB TOTL 2006 44881
#> 8 ABW URB TOTL 2007 44686
#> 9 ABW URB TOTL 2008 44375
#> 10 ABW URB TOTL 2009 44052
#> # … with 18,998 more rows
Now we can complete the tidying by pivoting variable
and
value
to make TOTL
and GROW
columns:
%>%
pop3 pivot_wider(names_from = variable, values_from = value)
#> # A tibble: 9,504 × 5
#> country area year TOTL GROW
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 ABW URB 2000 42444 1.18
#> 2 ABW URB 2001 43048 1.41
#> 3 ABW URB 2002 43670 1.43
#> 4 ABW URB 2003 44246 1.31
#> 5 ABW URB 2004 44669 0.951
#> 6 ABW URB 2005 44889 0.491
#> 7 ABW URB 2006 44881 -0.0178
#> 8 ABW URB 2007 44686 -0.435
#> 9 ABW URB 2008 44375 -0.698
#> 10 ABW URB 2009 44052 -0.731
#> # … with 9,494 more rows
Based on a suggestion by Maxime Wack, https://github.com/tidyverse/tidyr/issues/384), the final example shows how to deal with a common way of recording multiple choice data. Often you will get such data as follows:
<- tribble(
multi ~id, ~choice1, ~choice2, ~choice3,
1, "A", "B", "C",
2, "C", "B", NA,
3, "D", NA, NA,
4, "B", "D", NA
)
But the actual order isn’t important, and you’d prefer to have the
individual questions in the columns. You can achieve the desired
transformation in two steps. First, you make the data longer,
eliminating the explicit NA
s, and adding a column to
indicate that this choice was chosen:
<- multi %>%
multi2 pivot_longer(!id, values_drop_na = TRUE) %>%
mutate(checked = TRUE)
multi2#> # A tibble: 8 × 4
#> id name value checked
#> <dbl> <chr> <chr> <lgl>
#> 1 1 choice1 A TRUE
#> 2 1 choice2 B TRUE
#> 3 1 choice3 C TRUE
#> 4 2 choice1 C TRUE
#> 5 2 choice2 B TRUE
#> 6 3 choice1 D TRUE
#> 7 4 choice1 B TRUE
#> 8 4 choice2 D TRUE
Then you make the data wider, filling in the missing observations
with FALSE
:
%>%
multi2 pivot_wider(
id_cols = id,
names_from = value,
values_from = checked,
values_fill = FALSE
)#> # A tibble: 4 × 5
#> id A B C D
#> <dbl> <lgl> <lgl> <lgl> <lgl>
#> 1 1 TRUE TRUE TRUE FALSE
#> 2 2 FALSE TRUE TRUE FALSE
#> 3 3 FALSE FALSE FALSE TRUE
#> 4 4 FALSE TRUE FALSE TRUE
The arguments to pivot_longer()
and
pivot_wider()
allow you to pivot a wide range of datasets.
But the creativity that people apply to their data structures is
seemingly endless, so it’s quite possible that you will encounter a
dataset that you can’t immediately see how to reshape with
pivot_longer()
and pivot_wider()
. To gain more
control over pivotting, you can instead create a “spec” data frame that
describes exactly how data stored in the column names becomes variables
(and vice versa). This section introduces you to the spec data
structure, and show you how to use it when pivot_longer()
and pivot_wider()
are insufficient.
To see how this works, lets return to the simplest case of pivotting
applied to the relig_income
dataset. Now pivotting happens
in two steps: we first create a spec object (using
build_longer_spec()
) then use that to describe the
pivotting operation:
<- relig_income %>% build_longer_spec(
spec cols = !religion,
names_to = "income",
values_to = "count"
)pivot_longer_spec(relig_income, spec)
#> # A tibble: 180 × 3
#> religion income count
#> <chr> <chr> <dbl>
#> 1 Agnostic <$10k 27
#> 2 Agnostic $10-20k 34
#> 3 Agnostic $20-30k 60
#> 4 Agnostic $30-40k 81
#> 5 Agnostic $40-50k 76
#> 6 Agnostic $50-75k 137
#> 7 Agnostic $75-100k 122
#> 8 Agnostic $100-150k 109
#> 9 Agnostic >150k 84
#> 10 Agnostic Don't know/refused 96
#> # … with 170 more rows
(This gives the same result as before, just with more code. There’s
no need to use it here, it is presented as a simple example for using
spec
.)
What does spec
look like? It’s a data frame with one row
for each column in the wide format version of the data that is not
present in the long format, and two special columns that start with
.
:
.name
gives the name of the column..value
gives the name of the column that the values in
the cells will go into.There is also one column in spec
for each column present
in the long format of the data that is not present in the wide format of
the data. This corresponds to the names_to
argument in
pivot_longer()
and build_longer_spec()
and the
names_from
argument in pivot_wider()
and
build_wider_spec()
. In this example, the income column is a
character vector of the names of columns being pivoted.
spec#> # A tibble: 10 × 3
#> .name .value income
#> <chr> <chr> <chr>
#> 1 <$10k count <$10k
#> 2 $10-20k count $10-20k
#> 3 $20-30k count $20-30k
#> 4 $30-40k count $30-40k
#> 5 $40-50k count $40-50k
#> 6 $50-75k count $50-75k
#> 7 $75-100k count $75-100k
#> 8 $100-150k count $100-150k
#> 9 >150k count >150k
#> 10 Don't know/refused count Don't know/refused
Below we widen us_rent_income
with
pivot_wider()
. The result is ok, but I think it could be
improved:
%>%
us_rent_income pivot_wider(names_from = variable, values_from = c(estimate, moe))
#> # A tibble: 52 × 6
#> GEOID NAME estimate_income estimate_rent moe_income moe_rent
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 747 136 3
#> 2 02 Alaska 32940 1200 508 13
#> 3 04 Arizona 27517 972 148 4
#> 4 05 Arkansas 23789 709 165 5
#> 5 06 California 29454 1358 109 3
#> 6 08 Colorado 32401 1125 109 5
#> 7 09 Connecticut 35326 1123 195 5
#> 8 10 Delaware 31560 1076 247 10
#> 9 11 District of Columbia 43198 1424 681 17
#> 10 12 Florida 25952 1077 70 3
#> # … with 42 more rows
I think it would be better to have columns income
,
rent
, income_moe
, and rent_moe
,
which we can achieve with a manual spec. The current spec looks like
this:
<- us_rent_income %>%
spec1 build_wider_spec(names_from = variable, values_from = c(estimate, moe))
spec1#> # A tibble: 4 × 3
#> .name .value variable
#> <chr> <chr> <chr>
#> 1 estimate_income estimate income
#> 2 estimate_rent estimate rent
#> 3 moe_income moe income
#> 4 moe_rent moe rent
For this case, we mutate spec
to carefully construct the
column names:
<- spec1 %>%
spec2 mutate(.name = paste0(variable, ifelse(.value == "moe", "_moe", "")))
spec2#> # A tibble: 4 × 3
#> .name .value variable
#> <chr> <chr> <chr>
#> 1 income estimate income
#> 2 rent estimate rent
#> 3 income_moe moe income
#> 4 rent_moe moe rent
Supplying this spec to pivot_wider()
gives us the result
we’re looking for:
pivot_wider_spec(us_rent_income, spec2)
#> # A tibble: 52 × 6
#> GEOID NAME income rent income_moe rent_moe
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 01 Alabama 24476 747 136 3
#> 2 02 Alaska 32940 1200 508 13
#> 3 04 Arizona 27517 972 148 4
#> 4 05 Arkansas 23789 709 165 5
#> 5 06 California 29454 1358 109 3
#> 6 08 Colorado 32401 1125 109 5
#> 7 09 Connecticut 35326 1123 195 5
#> 8 10 Delaware 31560 1076 247 10
#> 9 11 District of Columbia 43198 1424 681 17
#> 10 12 Florida 25952 1077 70 3
#> # … with 42 more rows
Sometimes it’s not possible (or not convenient) to compute the spec,
and instead it’s more convenient to construct the spec “by hand”. For
example, take this construction
data, which is lightly
modified from Table 5 “completions” found at https://www.census.gov/construction/nrc/index.html:
construction#> # A tibble: 9 × 9
#> Year Month `1 unit` `2 to 4 units` 5 units …¹ North…² Midwest South West
#> <dbl> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018 January 859 NA 348 114 169 596 339
#> 2 2018 February 882 NA 400 138 160 655 336
#> 3 2018 March 862 NA 356 150 154 595 330
#> 4 2018 April 797 NA 447 144 196 613 304
#> 5 2018 May 875 NA 364 90 169 673 319
#> 6 2018 June 867 NA 342 76 170 610 360
#> 7 2018 July 829 NA 360 108 183 594 310
#> 8 2018 August 939 NA 286 90 205 649 286
#> 9 2018 September 835 NA 304 117 175 560 296
#> # … with abbreviated variable names ¹`5 units or more`, ²Northeast
This sort of data is not uncommon from government agencies: the column names actually belong to different variables, and here we have summaries for number of units (1, 2-4, 5+) and regions of the country (NE, NW, midwest, S, W). We can most easily describe that with a tibble:
<- tribble(
spec ~.name, ~.value, ~units, ~region,
"1 unit", "n", "1", NA,
"2 to 4 units", "n", "2-4", NA,
"5 units or more", "n", "5+", NA,
"Northeast", "n", NA, "Northeast",
"Midwest", "n", NA, "Midwest",
"South", "n", NA, "South",
"West", "n", NA, "West",
)
Which yields the following longer form:
pivot_longer_spec(construction, spec)
#> # A tibble: 63 × 5
#> Year Month units region n
#> <dbl> <chr> <chr> <chr> <dbl>
#> 1 2018 January 1 <NA> 859
#> 2 2018 January 2-4 <NA> NA
#> 3 2018 January 5+ <NA> 348
#> 4 2018 January <NA> Northeast 114
#> 5 2018 January <NA> Midwest 169
#> 6 2018 January <NA> South 596
#> 7 2018 January <NA> West 339
#> 8 2018 February 1 <NA> 882
#> 9 2018 February 2-4 <NA> NA
#> 10 2018 February 5+ <NA> 400
#> # … with 53 more rows
Note that there is no overlap between the units
and
region
variables; here the data would really be most
naturally described in two independent tables.
One neat property of the spec
is that you need the same
spec for pivot_longer()
and pivot_wider()
.
This makes it very clear that the two operations are symmetric:
%>%
construction pivot_longer_spec(spec) %>%
pivot_wider_spec(spec)
#> # A tibble: 9 × 9
#> Year Month `1 unit` `2 to 4 units` 5 units …¹ North…² Midwest South West
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2018 January 859 NA 348 114 169 596 339
#> 2 2018 February 882 NA 400 138 160 655 336
#> 3 2018 March 862 NA 356 150 154 595 330
#> 4 2018 April 797 NA 447 144 196 613 304
#> 5 2018 May 875 NA 364 90 169 673 319
#> 6 2018 June 867 NA 342 76 170 610 360
#> 7 2018 July 829 NA 360 108 183 594 310
#> 8 2018 August 939 NA 286 90 205 649 286
#> 9 2018 September 835 NA 304 117 175 560 296
#> # … with abbreviated variable names ¹`5 units or more`, ²Northeast
The pivotting spec allows us to be more precise about exactly how
pivot_longer(df, spec = spec)
changes the shape of
df
: it will have nrow(df) * nrow(spec)
rows,
and ncol(df) - nrow(spec) + ncol(spec) - 2
columns.