Capture melt

This vignette explains how to use functions for “melting” wide data tables, i.e. converting to tall/long data tables:

nc::capture_melt_single is useful when you have a wide data table with lots of columns, all of the same type, that you would like to melt into a single result/output column.
nc::capture_melt_multiple is useful when you have a wide data table with columns of different types that you would like to melt into multiple result/output columns.

Both are useful mainly in the case when you want to use a regular expression to specify both (1) the set of input columns to reshape and (2) some information to extract from those column names. To clarify the discussion we introduce the following three terms:

Reshape columns contain the data which are present in equal amounts in the input and output, but in fewer columns in the output. These are measure.vars in reshape2/data.table.
Copy columns contain data which are copied from a single row of the input into several rows of the output. These are id.vars in reshape2/data.table.
Capture columns are only present in the output, and contain data extracted from the names of the input reshape columns, using a capturing regular expression. As far as I know, tidyr::pivot_longer is the only other data reshaping function which directly supports capture columns.

Capture data.frame column names and melt matching columns

Sometimes you want to melt a “wide” data table which has several distinct pieces of information encoded in each column name. One example is the familiar iris data, which have flower part and measurement dimension encoded in each of four column names:

library(data.table)
(iris.dt <- data.table(i=1:nrow(iris), iris[,1:4], Species=paste(iris$Species)))
#>        i Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
#>   1:   1          5.1         3.5          1.4         0.2    setosa
#>   2:   2          4.9         3.0          1.4         0.2    setosa
#>   3:   3          4.7         3.2          1.3         0.2    setosa
#>   4:   4          4.6         3.1          1.5         0.2    setosa
#>   5:   5          5.0         3.6          1.4         0.2    setosa
#>  ---                                                                
#> 146: 146          6.7         3.0          5.2         2.3 virginica
#> 147: 147          6.3         2.5          5.0         1.9 virginica
#> 148: 148          6.5         3.0          5.2         2.0 virginica
#> 149: 149          6.2         3.4          5.4         2.3 virginica
#> 150: 150          5.9         3.0          5.1         1.8 virginica

Those four reshape column names can be specified via a regex in nc::capture_melt_single. The first argument is the input data table to reshape, and the subsequent arguments are interpreted as a pattern which is passed to nc::capture_first_vec. Any input column names which match the specified regex will be passed as measure.vars to melt:

(iris.tall <- nc::capture_melt_single(
  iris.dt,
  part=".*",
  "[.]",
  dim=".*"))
#>        i   Species  part    dim value
#>   1:   1    setosa Sepal Length   5.1
#>   2:   2    setosa Sepal Length   4.9
#>   3:   3    setosa Sepal Length   4.7
#>   4:   4    setosa Sepal Length   4.6
#>   5:   5    setosa Sepal Length   5.0
#>  ---                                 
#> 596: 146 virginica Petal  Width   2.3
#> 597: 147 virginica Petal  Width   1.9
#> 598: 148 virginica Petal  Width   2.0
#> 599: 149 virginica Petal  Width   2.3
#> 600: 150 virginica Petal  Width   1.8

The output comes from joining the result of nc::capture_first_vec to the result of melt. Note the output has two copy columns (i, Species), one reshape column (value), and two capture columns (part, dim).

After the data have been converted to the tall/long table above, we can do a dcast to get a wider data table:

(iris.part.cols <- dcast(
  iris.tall,
  i + Species + dim ~ part))
#>        i   Species    dim Petal Sepal
#>   1:   1    setosa Length   1.4   5.1
#>   2:   1    setosa  Width   0.2   3.5
#>   3:   2    setosa Length   1.4   4.9
#>   4:   2    setosa  Width   0.2   3.0
#>   5:   3    setosa Length   1.3   4.7
#>  ---                                 
#> 296: 148 virginica  Width   2.0   3.0
#> 297: 149 virginica Length   5.4   6.2
#> 298: 149 virginica  Width   2.3   3.4
#> 299: 150 virginica Length   5.1   5.9
#> 300: 150 virginica  Width   1.8   3.0

Rather than use capture_melt_single and then dcast, we could instead use capture_melt_multiple to get the same result:

nc::capture_melt_multiple(
  iris.dt,
  column=".*",
  "[.]",
  dim=".*")
#>        i   Species    dim Petal Sepal
#>   1:   1    setosa Length   1.4   5.1
#>   2:   2    setosa Length   1.4   4.9
#>   3:   3    setosa Length   1.3   4.7
#>   4:   4    setosa Length   1.5   4.6
#>   5:   5    setosa Length   1.4   5.0
#>  ---                                 
#> 296: 146 virginica  Width   2.3   3.0
#> 297: 147 virginica  Width   1.9   2.5
#> 298: 148 virginica  Width   2.0   3.0
#> 299: 149 virginica  Width   2.3   3.4
#> 300: 150 virginica  Width   1.8   3.0

Note that the reshaped table above contains two copy columns (i, Species), two reshape columns (Sepal, Petal), and one capture columns (dim). These can help us visualize whether or not sepals are bigger than petals:


if(require(ggplot2)){
  ggplot()+
    theme_bw()+
    theme(panel.spacing=grid::unit(0, "lines"))+
    facet_grid(dim ~ Species)+
    coord_equal()+
    geom_abline(slope=1, intercept=0, color="grey")+
    geom_point(aes(
      Petal, Sepal),
      data=iris.part.cols)
}
#> Loading required package: ggplot2

plot of chunk unnamed-chunk-5

It is clear from the plot above that sepals are indeed both longer and wider than petals, on each measured plant.

Another data set where it is useful to do column name pattern matching followed by melting is the World Health Organization data:


if(requireNamespace("tidyr")){
  data(who, package="tidyr")
}else{
  who <- data.frame(id=1, new_sp_m5564=2, newrel_f65=3)
}
#> Loading required namespace: tidyr
names(who)
#>  [1] "country"      "iso2"         "iso3"         "year"         "new_sp_m014" 
#>  [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
#> [11] "new_sp_m65"   "new_sp_f014"  "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
#> [16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
#> [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"  
#> [26] "new_sn_f014"  "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
#> [31] "new_sn_f5564" "new_sn_f65"   "new_ep_m014"  "new_ep_m1524" "new_ep_m2534"
#> [36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
#> [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
#> [46] "new_ep_f65"   "newrel_m014"  "newrel_m1524" "newrel_m2534" "newrel_m3544"
#> [51] "newrel_m4554" "newrel_m5564" "newrel_m65"   "newrel_f014"  "newrel_f1524"
#> [56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"

Each column which starts with new has three distinct pieces of information encoded in its name: diagnosis type (e.g. sp or rel), gender (m or f), and age range (e.g. 5564 or 1524). We would like to use a regex to match these column names, then using the matching columns as measure.vars in a melt, then join the two results.

new.diag.gender <- list(
  "new_?",
  diagnosis=".*",
  "_",
  gender=".")
nc::capture_melt_single(who, new.diag.gender, ages=".*")
#>                          country iso2 iso3 year diagnosis gender ages value
#>     1:               Afghanistan   AF  AFG 1997        sp      m  014     0
#>     2:               Afghanistan   AF  AFG 1998        sp      m  014    30
#>     3:               Afghanistan   AF  AFG 1999        sp      m  014     8
#>     4:               Afghanistan   AF  AFG 2000        sp      m  014    52
#>     5:               Afghanistan   AF  AFG 2001        sp      m  014   129
#>    ---                                                                     
#> 76042:                  Viet Nam   VN  VNM 2013       rel      f   65  3110
#> 76043: Wallis and Futuna Islands   WF  WLF 2013       rel      f   65     2
#> 76044:                     Yemen   YE  YEM 2013       rel      f   65   360
#> 76045:                    Zambia   ZM  ZMB 2013       rel      f   65   669
#> 76046:                  Zimbabwe   ZW  ZWE 2013       rel      f   65   725

Note the output includes the new reshape column called value by default, as in melt. The input reshape column names which matched the specified pattern, and there is a new column for each group in that pattern. The following example shows how to rename the value column and use numeric type conversion functions:

years.pattern <- list(new.diag.gender, ages=list(
  min.years="0|[0-9]{2}", as.numeric,
  max.years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x))))
(who.typed <- nc::capture_melt_single(
  who, years.pattern,
  value.name="count"))
#>                          country iso2 iso3 year diagnosis gender ages min.years
#>     1:               Afghanistan   AF  AFG 1997        sp      m  014         0
#>     2:               Afghanistan   AF  AFG 1998        sp      m  014         0
#>     3:               Afghanistan   AF  AFG 1999        sp      m  014         0
#>     4:               Afghanistan   AF  AFG 2000        sp      m  014         0
#>     5:               Afghanistan   AF  AFG 2001        sp      m  014         0
#>    ---                                                                         
#> 76042:                  Viet Nam   VN  VNM 2013       rel      f   65        65
#> 76043: Wallis and Futuna Islands   WF  WLF 2013       rel      f   65        65
#> 76044:                     Yemen   YE  YEM 2013       rel      f   65        65
#> 76045:                    Zambia   ZM  ZMB 2013       rel      f   65        65
#> 76046:                  Zimbabwe   ZW  ZWE 2013       rel      f   65        65
#>        max.years count
#>     1:        14     0
#>     2:        14    30
#>     3:        14     8
#>     4:        14    52
#>     5:        14   129
#>    ---                
#> 76042:       Inf  3110
#> 76043:       Inf     2
#> 76044:       Inf   360
#> 76045:       Inf   669
#> 76046:       Inf   725
str(who.typed)
#> Classes 'data.table' and 'data.frame':   76046 obs. of  10 variables:
#>  $ country  : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#>  $ iso2     : chr  "AF" "AF" "AF" "AF" ...
#>  $ iso3     : chr  "AFG" "AFG" "AFG" "AFG" ...
#>  $ year     : int  1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
#>  $ diagnosis: chr  "sp" "sp" "sp" "sp" ...
#>  $ gender   : chr  "m" "m" "m" "m" ...
#>  $ ages     : chr  "014" "014" "014" "014" ...
#>  $ min.years: num  0 0 0 0 0 0 0 0 0 0 ...
#>  $ max.years: num  14 14 14 14 14 14 14 14 14 14 ...
#>  $ count    : int  0 30 8 52 129 90 127 139 151 193 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

The result above shows that

non-character captured output columns can be obtained by specifying type conversion functions in the pattern.
value.name is used as the column name for the reshaped data in the output (as in melt.data.table).

In conclusion nc::capture_melt_single does data reshaping from wide to tall/long which (1) makes it easy to define complex patterns (2) supports type conversion without a post-processing step, and (3) reduces repetition in user code.

Melting columns of different types into multiple result columns

In the previous sections we discussed methods for melting wide data columns that all have the same type. Sometimes wide data have columns of different types that we want to melt. The example below is taken from a data.table vignette,

family.dt <- fread(text="
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1         30 1998-11-26 2000-01-29         NA             1             2            NA
2         27 1996-06-22         NA         NA             2            NA            NA
3         26 2002-07-11 2004-04-05 2007-09-02             2             2             1
4         32 2004-10-10 2009-08-27 2012-07-21             1             1             1
5         29 2000-12-05 2005-02-28         NA             2             1            NA")

The data table above has one row for each family, and two columns for each child. One column is the date of birth (character), the other is the gender (coded as an integer). We can use standard data.table functions to melt these data into a tall/long table with one row for each child:

melt(family.dt, measure.vars=patterns(
  dob="^dob", gender="^gender"
))
#>     family_id age_mother variable        dob gender
#>  1:         1         30        1 1998-11-26      1
#>  2:         2         27        1 1996-06-22      2
#>  3:         3         26        1 2002-07-11      2
#>  4:         4         32        1 2004-10-10      1
#>  5:         5         29        1 2000-12-05      2
#>  6:         1         30        2 2000-01-29      2
#>  7:         2         27        2       <NA>     NA
#>  8:         3         26        2 2004-04-05      2
#>  9:         4         32        2 2009-08-27      1
#> 10:         5         29        2 2005-02-28      1
#> 11:         1         30        3       <NA>     NA
#> 12:         2         27        3       <NA>     NA
#> 13:         3         26        3 2007-09-02      1
#> 14:         4         32        3 2012-07-21      1
#> 15:         5         29        3       <NA>     NA

The melt above results in a table with one row for each child, and one column for each argument of patterns. However the code is somewhat repetitive, as dob and gender must be repeated. To avoid this repetition we can instead use:

(children.dt <- nc::capture_melt_multiple(
  family.dt,
  column=".*",
  "_",
  nc::field("child", "", "[1-3]", as.integer),
  na.rm=TRUE))
#>     family_id age_mother child        dob gender
#>  1:         1         30     1 1998-11-26      1
#>  2:         2         27     1 1996-06-22      2
#>  3:         3         26     1 2002-07-11      2
#>  4:         4         32     1 2004-10-10      1
#>  5:         5         29     1 2000-12-05      2
#>  6:         1         30     2 2000-01-29      2
#>  7:         3         26     2 2004-04-05      2
#>  8:         4         32     2 2009-08-27      1
#>  9:         5         29     2 2005-02-28      1
#> 10:         3         26     3 2007-09-02      1
#> 11:         4         32     3 2012-07-21      1
str(children.dt)
#> Classes 'data.table' and 'data.frame':   11 obs. of  5 variables:
#>  $ family_id : int  1 2 3 4 5 1 3 4 5 3 ...
#>  $ age_mother: int  30 27 26 32 29 30 26 32 29 26 ...
#>  $ child     : int  1 1 1 1 1 2 2 2 2 3 ...
#>  $ dob       : IDate, format: "1998-11-26" "1996-06-22" ...
#>  $ gender    : int  1 2 2 1 2 2 2 1 1 1 ...
#>  - attr(*, ".internal.selfref")=<externalptr>

The result above is similar to the previous result (and in fact melt.data.table is used internally), but the code is less repetitive. The first argument of capture_melt_multiple is the subject data table and the following arguments form a pattern which is matched to the input data column names. The pattern must have at least two groups:

One group must be named column. This group is used to capture the part of the input column names which will be used for the output column names. The output will contain one column for each unique value captured in this group.
One or more other groups (with any names) specify an identifier for each observation (above it was the child number).

Another example is the iris data set which usually has one 5-dimensional observation per row (four numeric measures, and one character Species). Below we create a shuffled version of the iris data with two observations per row:

set.seed(1)
iris.rand <- iris.dt[sample(.N)]
iris.wide <- cbind(treatment=iris.rand[1:75], control=iris.rand[76:150])
print(iris.wide, topn=2, nrows=10)
#>     treatment.i treatment.Sepal.Length treatment.Sepal.Width
#>  1:          68                    5.8                   2.7
#>  2:         129                    6.4                   2.8
#> ---                                                         
#> 74:          91                    5.5                   2.6
#> 75:          64                    6.1                   2.9
#>     treatment.Petal.Length treatment.Petal.Width treatment.Species control.i
#>  1:                    4.1                   1.0        versicolor        60
#>  2:                    5.6                   2.1         virginica       113
#> ---                                                                         
#> 74:                    4.4                   1.2        versicolor        57
#> 75:                    4.7                   1.4        versicolor        72
#>     control.Sepal.Length control.Sepal.Width control.Petal.Length
#>  1:                  5.2                 2.7                  3.9
#>  2:                  6.8                 3.0                  5.5
#> ---                                                              
#> 74:                  6.3                 3.3                  4.7
#> 75:                  6.1                 2.8                  4.0
#>     control.Petal.Width control.Species
#>  1:                 1.4      versicolor
#>  2:                 2.1       virginica
#> ---                                    
#> 74:                 1.6      versicolor
#> 75:                 1.3      versicolor

Below we show the usual data.table syntax for getting the original iris back.

iris.melted <- melt(iris.wide, measure.vars = patterns(
  i="i$",
  Sepal.Length="Sepal.Length$",
  Sepal.Width="Sepal.Width$",
  Petal.Length="Petal.Length$",
  Petal.Width="Petal.Width$",
  Species="Species$"))
identical(iris.melted[order(i), names(iris.dt), with=FALSE], iris.dt)
#> [1] TRUE

We can do the same thing with less repetition via:

(nc.melted <- nc::capture_melt_multiple(
  iris.wide,
  group=".*?",
  "[.]",
  column=".*"))
#>          group Petal.Length Petal.Width Sepal.Length Sepal.Width    Species   i
#>   1:   control          3.9         1.4          5.2         2.7 versicolor  60
#>   2:   control          5.5         2.1          6.8         3.0  virginica 113
#>   3:   control          5.6         1.4          6.1         2.6  virginica 135
#>   4:   control          1.5         0.1          4.9         3.1     setosa  10
#>   5:   control          1.4         0.2          5.1         3.5     setosa   1
#>  ---                                                                           
#> 146: treatment          1.6         0.2          4.8         3.1     setosa  31
#> 147: treatment          1.3         0.4          5.4         3.9     setosa  17
#> 148: treatment          5.4         2.1          6.9         3.1  virginica 140
#> 149: treatment          4.4         1.2          5.5         2.6 versicolor  91
#> 150: treatment          4.7         1.4          6.1         2.9 versicolor  64
identical(nc.melted[order(i), names(iris.dt), with=FALSE], iris.dt)
#> [1] TRUE

To conclude this section, we have seen that a simple non-repetitive regex syntax can be used with nc::capture_melt_multiple to melt wide data into several columns of different types.