Introduction to groupr

The groupr package is designed to make certain forms of data manipulation easier by representing the underlying data in richer ways. In particular, the standard grouping function dplyr::group_by is extended to include groups that can be marked “inapplicable” at certain values of the grouping variable. The hope is that code that can recognize these kinds of groups will be simpler to write and easier to understand. The package also provides functions for some tasks, like pivoting, that are especially well suited to this idea.

The meaning of inapplicable groups

In dplyr, groups are denoted with a grouping column that contains unique values for every group. For example, we can group mtcars by the variable vs:

library(dplyr, warn.conflicts = FALSE)
group_by(mtcars, vs)
#> # A tibble: 32 x 11
#> # Groups:   vs [2]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

The result is a dataset with two groups defined by vs == 1 and vs == 0.

In groups, we can optionally mark one of the two groups as inapplicable:

#> # A tibble:    32 x 11
#> # Row indices: vs (I: 1) [2]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

What this means depends on what comes next. Here are a few possibilities:

The data can be ignored for now, but we might need it later
The data describe a higher level in a grouping hierarchy, so they don’t need a subgroup value
The data should be excluded from the input of a calculation

These different meanings will be clear in the context of actual data cleaning operations.

Data cleaning and representation

Pivoting

As an operation on groups

Pivoting can be thought of as a simple rearrangement of groups. Consider the iris dataset:

as_tibble(iris)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

We could pivot to “longer” format by collapsing the different measurements into a single column. Equivalently, we can consider the columns Seal.Length, Sepal.Width, ... to describe groups of data, not distinct variables. In other words, the four columns together form one collection of data, and each column is a subgroup of that collection.

To pivot, we transfer this “column grouping” to the standard dplyr “row grouping.” To do this we just take our groups out of the different columns and merge them into one. The result is a consolidated column of data (value), along with a standard (row) grouping variable (type).

iris2 <- group_by2(iris, Species) %>% 
  colgrp("value", "type")

pivot_grps(iris2, rows = "type")
#> # A tibble:    600 x 3
#> # Row indices: Species, type [12]
#>    Species type         value
#>    <fct>   <chr>        <dbl>
#>  1 setosa  Sepal.Length   5.1
#>  2 setosa  Sepal.Length   4.9
#>  3 setosa  Sepal.Length   4.7
#>  4 setosa  Sepal.Length   4.6
#>  5 setosa  Sepal.Length   5  
#>  6 setosa  Sepal.Length   5.4
#>  7 setosa  Sepal.Length   4.6
#>  8 setosa  Sepal.Length   5  
#>  9 setosa  Sepal.Length   4.4
#> 10 setosa  Sepal.Length   4.9
#> # … with 590 more rows

So, pivoting to longer is the same as converting column groupings to row groupings, and pivoting to wider just does the inverse.

Inapplicable group pivoting

Consider this example dataset:

df <- tibble(
  grp = c(1, 1, 1, 1, 2),
  subgrp = c(1, 2, 3, 4, NA),
  val = c(3.1, 2.8, 4.0, 3.8, 10.2)
)

df
#> # A tibble: 5 x 3
#>     grp subgrp   val
#>   <dbl>  <dbl> <dbl>
#> 1     1      1   3.1
#> 2     1      2   2.8
#> 3     1      3   4  
#> 4     1      4   3.8
#> 5     2     NA  10.2

Imagine we want to convert the row grouping defined by grp into a column grouping. Without inapplicable groups we get this:

regular_df <- group_by2(df, grp, subgrp)
pivot_grps(regular_df, cols = "grp")
#> # A tibble:    5 x 2
#> # Row indices: subgrp [5]
#> # Col index:   grp
#>   subgrp val$`1`  $`2`
#>    <dbl>   <dbl> <dbl>
#> 1      1     3.1  NA  
#> 2      2     2.8  NA  
#> 3      3     4    NA  
#> 4      4     3.8  NA  
#> 5     NA    NA    10.2

It looks a bit off. What if we wanted val_2 == 10.2 for all values of subgrp? In other words, what if val = 10.2 describes the entire second group?

This is an example of an operation that is very challenging to write with standard pivoting functions, but trivial with inapplicable groups. Simply group like this before pivoting:

igrp_df <- group_by2(df, grp, subgrp = NA)
pivot_grps(igrp_df, cols = "grp")
#> # A tibble:    4 x 2
#> # Row indices: subgrp [4]
#> # Col index:   grp
#>   subgrp val$`1`  $`2`
#>    <dbl>   <dbl> <dbl>
#> 1      1     3.1  10.2
#> 2      2     2.8  10.2
#> 3      3     4    10.2
#> 4      4     3.8  10.2

In this case we have a hierarchical grouping, where there are allowed to be multiple values for each value of grp but we may also have a single value that describes all the subgroups.

Note also how the only difference is in the grouping structure. The operation itself remains concise and easy to understand.

Later: selecting data for computation

It is common to have a calculation to apply to only a subset of the data. For example, if you have group A and group B, you may be interested in calculating a mean for group A but leaving it missing for all the rows in group B. Depending on the calculation, this can be tough to express.

In mtcars, if we want the mean of hp for all rows where vs == 1, the easiest way is something like the following:

mtcars %>%
  group_by2(vs = 0) %>%
  mutate(hp_mean_vs1 = mean(hp))

Mutations are not currently provided in groups but will be added in the future.