dplyr verbs are particularly powerful when you apply them to grouped
data frames (grouped_df
objects). This vignette shows
you:
How to group, inspect, and ungroup with group_by()
and friends.
How individual dplyr verbs changes their behaviour when applied to grouped data frame.
How to access data about the “current” group from within a verb.
We’ll start by loading dplyr:
library(dplyr)
group_by()
The most important grouping verb is group_by()
: it takes
a data frame and one or more variables to group by:
<- starwars %>% group_by(species)
by_species <- starwars %>% group_by(sex, gender) by_sex_gender
You can see the grouping when you print the data:
by_species#> # A tibble: 87 × 14
#> # Groups: species [38]
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
#> # … with 83 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
by_sex_gender#> # A tibble: 87 × 14
#> # Groups: sex, gender [6]
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
#> # … with 83 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
Or use tally()
to count the number of rows in each
group. The sort
argument is useful if you want to see the
largest groups up front.
%>% tally()
by_species #> # A tibble: 38 × 2
#> species n
#> <chr> <int>
#> 1 Aleena 1
#> 2 Besalisk 1
#> 3 Cerean 1
#> 4 Chagrian 1
#> # … with 34 more rows
%>% tally(sort = TRUE)
by_sex_gender #> # A tibble: 6 × 3
#> # Groups: sex [5]
#> sex gender n
#> <chr> <chr> <int>
#> 1 male masculine 60
#> 2 female feminine 16
#> 3 none masculine 5
#> 4 <NA> <NA> 4
#> # … with 2 more rows
As well as grouping by existing variables, you can group by any
function of existing variables. This is equivalent to performing a
mutate()
before the
group_by()
:
<- c(0, 18.5, 25, 30, Inf)
bmi_breaks
%>%
starwars group_by(bmi_cat = cut(mass/(height/100)^2, breaks=bmi_breaks)) %>%
tally()
#> # A tibble: 5 × 2
#> bmi_cat n
#> <fct> <int>
#> 1 (0,18.5] 10
#> 2 (18.5,25] 24
#> 3 (25,30] 13
#> 4 (30,Inf] 12
#> # … with 1 more row
You can see underlying group data with group_keys()
. It
has one row for each group and one column for each grouping
variable:
%>% group_keys()
by_species #> # A tibble: 38 × 1
#> species
#> <chr>
#> 1 Aleena
#> 2 Besalisk
#> 3 Cerean
#> 4 Chagrian
#> # … with 34 more rows
%>% group_keys()
by_sex_gender #> # A tibble: 6 × 2
#> sex gender
#> <chr> <chr>
#> 1 female feminine
#> 2 hermaphroditic masculine
#> 3 male masculine
#> 4 none feminine
#> # … with 2 more rows
You can see which group each row belongs to with
group_indices()
:
%>% group_indices()
by_species #> [1] 11 6 6 11 11 11 11 6 11 11 11 11 34 11 24 12 11 11 36 11 11 6 31 11 11
#> [26] 18 11 11 8 26 11 21 11 10 10 10 38 30 7 38 11 37 32 32 33 35 29 11 3 20
#> [51] 37 27 13 23 16 4 11 11 11 9 17 17 11 11 11 11 5 2 15 15 11 1 6 25 19
#> [76] 28 14 34 11 38 22 11 11 11 6 38 11
And which rows each group contains with
group_rows()
:
%>% group_rows() %>% head()
by_species #> <list_of<integer>[6]>
#> [[1]]
#> [1] 72
#>
#> [[2]]
#> [1] 68
#>
#> [[3]]
#> [1] 49
#>
#> [[4]]
#> [1] 56
#>
#> [[5]]
#> [1] 67
#>
#> [[6]]
#> [1] 2 3 8 22 73 85
Use group_vars()
if you just want the names of the
grouping variables:
%>% group_vars()
by_species #> [1] "species"
%>% group_vars()
by_sex_gender #> [1] "sex" "gender"
If you apply group_by()
to an already grouped dataset,
will overwrite the existing grouping variables. For example, the
following code groups by homeworld
instead of
species
:
%>%
by_species group_by(homeworld) %>%
tally()
#> # A tibble: 49 × 2
#> homeworld n
#> <chr> <int>
#> 1 Alderaan 3
#> 2 Aleen Minor 1
#> 3 Bespin 1
#> 4 Bestine IV 1
#> # … with 45 more rows
To augment the grouping, using
.add = TRUE
1. For example, the following code groups by
species and homeworld:
%>%
by_species group_by(homeworld, .add = TRUE) %>%
tally()
#> # A tibble: 58 × 3
#> # Groups: species [38]
#> species homeworld n
#> <chr> <chr> <int>
#> 1 Aleena Aleen Minor 1
#> 2 Besalisk Ojom 1
#> 3 Cerean Cerea 1
#> 4 Chagrian Champala 1
#> # … with 54 more rows
To remove all grouping variables, use ungroup()
:
%>%
by_species ungroup() %>%
tally()
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 87
You can also choose to selectively ungroup by listing the variables you want to remove:
%>%
by_sex_gender ungroup(sex) %>%
tally()
#> # A tibble: 3 × 2
#> gender n
#> <chr> <int>
#> 1 feminine 17
#> 2 masculine 66
#> 3 <NA> 4
The following sections describe how grouping affects the main dplyr verbs.
summarise()
summarise()
computes a summary for each group. This
means that it starts from group_keys()
, adding summary
variables to the right hand side:
%>%
by_species summarise(
n = n(),
height = mean(height, na.rm = TRUE)
)#> # A tibble: 38 × 3
#> species n height
#> <chr> <int> <dbl>
#> 1 Aleena 1 79
#> 2 Besalisk 1 198
#> 3 Cerean 1 198
#> 4 Chagrian 1 196
#> # … with 34 more rows
The .groups=
argument controls the grouping structure of
the output. The historical behaviour of removing the right hand side
grouping variable corresponds to .groups = "drop_last"
without a message or .groups = NULL
with a message (the
default).
%>%
by_sex_gender summarise(n = n()) %>%
group_vars()
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
#> [1] "sex"
%>%
by_sex_gender summarise(n = n(), .groups = "drop_last") %>%
group_vars()
#> [1] "sex"
Since version 1.0.0 the groups may also be kept
(.groups = "keep"
) or dropped
(.groups = "drop"
).
%>%
by_sex_gender summarise(n = n(), .groups = "keep") %>%
group_vars()
#> [1] "sex" "gender"
%>%
by_sex_gender summarise(n = n(), .groups = "drop") %>%
group_vars()
#> character(0)
When the output no longer have grouping variables, it becomes ungrouped (i.e. a regular tibble).
select()
, rename()
, and
relocate()
rename()
and relocate()
behave identically
with grouped and ungrouped data because they only affect the name or
position of existing columns. Grouped select()
is almost
identical to ungrouped select, except that it always includes the
grouping variables:
%>% select(mass)
by_species #> Adding missing grouping variables: `species`
#> # A tibble: 87 × 2
#> # Groups: species [38]
#> species mass
#> <chr> <dbl>
#> 1 Human 77
#> 2 Droid 75
#> 3 Droid 32
#> 4 Human 136
#> # … with 83 more rows
If you don’t want the grouping variables, you’ll have to first
ungroup()
. (This design is possibly a mistake, but we’re
stuck with it for now.)
arrange()
Grouped arrange()
is the same as ungrouped
arrange()
, unless you set .by_group = TRUE
, in
which case it will order first by the grouping variables.
%>%
by_species arrange(desc(mass)) %>%
relocate(species, mass)
#> # A tibble: 87 × 14
#> # Groups: species [38]
#> species mass name height hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender
#> <chr> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Hutt 1358 Jabba Desil… 175 <NA> green-… orange 600 herm… mascu…
#> 2 Kaleesh 159 Grievous 216 none brown,… green,… NA male mascu…
#> 3 Droid 140 IG-88 200 none metal red 15 none mascu…
#> 4 Human 136 Darth Vader 202 none white yellow 41.9 male mascu…
#> # … with 83 more rows, 4 more variables: homeworld <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year
%>%
by_species arrange(desc(mass), .by_group = TRUE) %>%
relocate(species, mass)
#> # A tibble: 87 × 14
#> # Groups: species [38]
#> species mass name height hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender
#> <chr> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Aleena 15 Ratts Tyer… 79 none grey, … unknown NA male mascu…
#> 2 Besalisk 102 Dexter Jet… 198 none brown yellow NA male mascu…
#> 3 Cerean 82 Ki-Adi-Mun… 198 white pale yellow 92 male mascu…
#> 4 Chagrian NA Mas Amedda 196 none blue blue NA male mascu…
#> # … with 83 more rows, 4 more variables: homeworld <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year
Note that second example is sorted by species
(from the
group_by()
statement) and then by mass
(within
species).
mutate()
and transmute()
In simple cases with vectorised functions, grouped and ungrouped
mutate()
give the same results. They differ when used with
summary functions:
# Subtract off global mean
%>%
starwars select(name, homeworld, mass) %>%
mutate(standard_mass = mass - mean(mass, na.rm = TRUE))
#> # A tibble: 87 × 4
#> name homeworld mass standard_mass
#> <chr> <chr> <dbl> <dbl>
#> 1 Luke Skywalker Tatooine 77 -20.3
#> 2 C-3PO Tatooine 75 -22.3
#> 3 R2-D2 Naboo 32 -65.3
#> 4 Darth Vader Tatooine 136 38.7
#> # … with 83 more rows
# Subtract off homeworld mean
%>%
starwars select(name, homeworld, mass) %>%
group_by(homeworld) %>%
mutate(standard_mass = mass - mean(mass, na.rm = TRUE))
#> # A tibble: 87 × 4
#> # Groups: homeworld [49]
#> name homeworld mass standard_mass
#> <chr> <chr> <dbl> <dbl>
#> 1 Luke Skywalker Tatooine 77 -8.38
#> 2 C-3PO Tatooine 75 -10.4
#> 3 R2-D2 Naboo 32 -32.2
#> 4 Darth Vader Tatooine 136 50.6
#> # … with 83 more rows
Or with window functions like min_rank()
:
# Overall rank
%>%
starwars select(name, homeworld, height) %>%
mutate(rank = min_rank(height))
#> # A tibble: 87 × 4
#> name homeworld height rank
#> <chr> <chr> <int> <int>
#> 1 Luke Skywalker Tatooine 172 29
#> 2 C-3PO Tatooine 167 21
#> 3 R2-D2 Naboo 96 5
#> 4 Darth Vader Tatooine 202 72
#> # … with 83 more rows
# Rank per homeworld
%>%
starwars select(name, homeworld, height) %>%
group_by(homeworld) %>%
mutate(rank = min_rank(height))
#> # A tibble: 87 × 4
#> # Groups: homeworld [49]
#> name homeworld height rank
#> <chr> <chr> <int> <int>
#> 1 Luke Skywalker Tatooine 172 5
#> 2 C-3PO Tatooine 167 4
#> 3 R2-D2 Naboo 96 1
#> 4 Darth Vader Tatooine 202 10
#> # … with 83 more rows
filter()
A grouped filter()
effectively does a
mutate()
to generate a logical variable, and then only
keeps the rows where the variable is TRUE
. This means that
grouped filters can be used with summary functions. For example, we can
find the tallest character of each species:
%>%
by_species select(name, species, height) %>%
filter(height == max(height))
#> # A tibble: 35 × 3
#> # Groups: species [35]
#> name species height
#> <chr> <chr> <int>
#> 1 Greedo Rodian 173
#> 2 Jabba Desilijic Tiure Hutt 175
#> 3 Yoda Yoda's species 66
#> 4 Bossk Trandoshan 190
#> # … with 31 more rows
You can also use filter()
to remove entire groups. For
example, the following code eliminates all groups that only have a
single member:
%>%
by_species filter(n() != 1) %>%
tally()
#> # A tibble: 9 × 2
#> species n
#> <chr> <int>
#> 1 Droid 6
#> 2 Gungan 3
#> 3 Human 35
#> 4 Kaminoan 2
#> # … with 5 more rows
slice()
and friendsslice()
and friends (slice_head()
,
slice_tail()
, slice_sample()
,
slice_min()
and slice_max()
) select rows
within a group. For example, we can select the first observation within
each species:
%>%
by_species relocate(species) %>%
slice(1)
#> # A tibble: 38 × 14
#> # Groups: species [38]
#> species name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender
#> <chr> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
#> 1 Aleena Ratts Tyer… 79 15 none grey, … unknown NA male mascu…
#> 2 Besalisk Dexter Jet… 198 102 none brown yellow NA male mascu…
#> 3 Cerean Ki-Adi-Mun… 198 82 white pale yellow 92 male mascu…
#> 4 Chagrian Mas Amedda 196 NA none blue blue NA male mascu…
#> # … with 34 more rows, 4 more variables: homeworld <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year
Similarly, we can use slice_min()
to select the smallest
n
values of a variable:
%>%
by_species filter(!is.na(height)) %>%
slice_min(height, n = 2)
#> # A tibble: 48 × 14
#> # Groups: species [38]
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Ratts Tyere… 79 15 none grey, … unknown NA male mascu… Aleen …
#> 2 Dexter Jett… 198 102 none brown yellow NA male mascu… Ojom
#> 3 Ki-Adi-Mundi 198 82 white pale yellow 92 male mascu… Cerea
#> 4 Mas Amedda 196 NA none blue blue NA male mascu… Champa…
#> # … with 44 more rows, 4 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, and abbreviated variable names
#> # ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
Inside dplyr verbs, you can access various properties of the
“current” group using a family of functions with the cur_
prefix. These functions are typically needed for everyday usage of
dplyr, but can be useful because they allow you to free from some of the
typical constraints of dplyr verbs.
cur_data()
cur_data()
returns the current group, excluding grouping
variables. It’s useful to feed to functions that take a whole data
frame. For example, the following code fits a linear model of
mass ~ height
to each species:
%>%
by_species filter(n() > 1) %>%
mutate(mod = list(lm(mass ~ height, data = cur_data())))
#> # A tibble: 58 × 15
#> # Groups: species [9]
#> name height mass hair_…¹ skin_…² eye_c…³ birth…⁴ sex gender homew…⁵
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
#> 1 Luke Skywal… 172 77 blond fair blue 19 male mascu… Tatooi…
#> 2 C-3PO 167 75 <NA> gold yellow 112 none mascu… Tatooi…
#> 3 R2-D2 96 32 <NA> white,… red 33 none mascu… Naboo
#> 4 Darth Vader 202 136 none white yellow 41.9 male mascu… Tatooi…
#> # … with 54 more rows, 5 more variables: species <chr>, films <list>,
#> # vehicles <list>, starships <list>, mod <list>, and abbreviated variable
#> # names ¹hair_color, ²skin_color, ³eye_color, ⁴birth_year, ⁵homeworld
cur_group()
and cur_group_id()
cur_group_id()
gives a unique numeric identifier for the
current group. This is sometimes useful if you want to index into an
external data structure.
%>%
by_species arrange(species) %>%
select(name, species, homeworld) %>%
mutate(id = cur_group_id())
#> # A tibble: 87 × 4
#> # Groups: species [38]
#> name species homeworld id
#> <chr> <chr> <chr> <int>
#> 1 Ratts Tyerell Aleena Aleen Minor 1
#> 2 Dexter Jettster Besalisk Ojom 2
#> 3 Ki-Adi-Mundi Cerean Cerea 3
#> 4 Mas Amedda Chagrian Champala 4
#> # … with 83 more rows
Note that the argument changed from
add = TRUE
to .add = TRUE
in dplyr 1.0.0.↩︎