The groupr package is designed to make certain forms of data manipulation easier by representing the underlying data in richer ways. In particular, the standard grouping function dplyr::group_by
is extended to include groups that can be marked “inapplicable” at certain values of the grouping variable. The hope is that code that can recognize these kinds of groups will be simpler to write and easier to understand. The package also provides functions for some tasks, like pivoting, that are especially well suited to this idea.
In dplyr, groups are denoted with a grouping column that contains unique values for every group. For example, we can group mtcars
by the variable vs
:
library(dplyr, warn.conflicts = FALSE)
group_by(mtcars, vs)
#> # A tibble: 32 x 11
#> # Groups: vs [2]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
The result is a dataset with two groups defined by vs == 1
and vs == 0
.
In groups
, we can optionally mark one of the two groups as inapplicable:
#> # A tibble: 32 x 11
#> # Row indices: vs (I: 1) [2]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
What this means depends on what comes next. Here are a few possibilities:
These different meanings will be clear in the context of actual data cleaning operations.
Pivoting can be thought of as a simple rearrangement of groups. Consider the iris dataset:
as_tibble(iris)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
We could pivot to “longer” format by collapsing the different measurements into a single column. Equivalently, we can consider the columns Seal.Length, Sepal.Width, ...
to describe groups of data, not distinct variables. In other words, the four columns together form one collection of data, and each column is a subgroup of that collection.
To pivot, we transfer this “column grouping” to the standard dplyr “row grouping.” To do this we just take our groups out of the different columns and merge them into one. The result is a consolidated column of data (value
), along with a standard (row) grouping variable (type
).
iris2 <- group_by2(iris, Species) %>%
colgrp("value", "type")
pivot_grps(iris2, rows = "type")
#> # A tibble: 600 x 3
#> # Row indices: Species, type [12]
#> Species type value
#> <fct> <chr> <dbl>
#> 1 setosa Sepal.Length 5.1
#> 2 setosa Sepal.Length 4.9
#> 3 setosa Sepal.Length 4.7
#> 4 setosa Sepal.Length 4.6
#> 5 setosa Sepal.Length 5
#> 6 setosa Sepal.Length 5.4
#> 7 setosa Sepal.Length 4.6
#> 8 setosa Sepal.Length 5
#> 9 setosa Sepal.Length 4.4
#> 10 setosa Sepal.Length 4.9
#> # … with 590 more rows
So, pivoting to longer is the same as converting column groupings to row groupings, and pivoting to wider just does the inverse.
Consider this example dataset:
df <- tibble(
grp = c(1, 1, 1, 1, 2),
subgrp = c(1, 2, 3, 4, NA),
val = c(3.1, 2.8, 4.0, 3.8, 10.2)
)
df
#> # A tibble: 5 x 3
#> grp subgrp val
#> <dbl> <dbl> <dbl>
#> 1 1 1 3.1
#> 2 1 2 2.8
#> 3 1 3 4
#> 4 1 4 3.8
#> 5 2 NA 10.2
Imagine we want to convert the row grouping defined by grp
into a column grouping. Without inapplicable groups we get this:
regular_df <- group_by2(df, grp, subgrp)
pivot_grps(regular_df, cols = "grp")
#> # A tibble: 5 x 2
#> # Row indices: subgrp [5]
#> # Col index: grp
#> subgrp val$`1` $`2`
#> <dbl> <dbl> <dbl>
#> 1 1 3.1 NA
#> 2 2 2.8 NA
#> 3 3 4 NA
#> 4 4 3.8 NA
#> 5 NA NA 10.2
It looks a bit off. What if we wanted val_2 == 10.2
for all values of subgrp? In other words, what if val = 10.2
describes the entire second group?
This is an example of an operation that is very challenging to write with standard pivoting functions, but trivial with inapplicable groups. Simply group like this before pivoting:
igrp_df <- group_by2(df, grp, subgrp = NA)
pivot_grps(igrp_df, cols = "grp")
#> # A tibble: 4 x 2
#> # Row indices: subgrp [4]
#> # Col index: grp
#> subgrp val$`1` $`2`
#> <dbl> <dbl> <dbl>
#> 1 1 3.1 10.2
#> 2 2 2.8 10.2
#> 3 3 4 10.2
#> 4 4 3.8 10.2
In this case we have a hierarchical grouping, where there are allowed to be multiple values for each value of grp
but we may also have a single value that describes all the subgroups.
Note also how the only difference is in the grouping structure. The operation itself remains concise and easy to understand.
It is common to have a calculation to apply to only a subset of the data. For example, if you have group A and group B, you may be interested in calculating a mean for group A but leaving it missing for all the rows in group B. Depending on the calculation, this can be tough to express.
In mtcars, if we want the mean of hp
for all rows where vs == 1
, the easiest way is something like the following:
Mutations are not currently provided in groups
but will be added in the future.