This vignette explains how to use functions for “melting” wide data tables, i.e. converting to tall/long data tables:
nc::capture_melt_single
is useful when you have a wide data table
with lots of columns, all of the same type, that you would like to
melt into a single result/output column.nc::capture_melt_multiple
is useful when you have a wide
data table with columns of different types that you would like to
melt into multiple result/output columns.Both are useful mainly in the case when you want to use a regular expression to specify both (1) the set of input columns to reshape and (2) some information to extract from those column names. To clarify the discussion we introduce the following three terms:
measure.vars
in reshape2/data.table.id.vars
in reshape2/data.table.tidyr::pivot_longer
is the only other
data reshaping function which directly supports
capture columns.Sometimes you want to melt a “wide” data table which has several distinct pieces of information encoded in each column name. One example is the familiar iris data, which have flower part and measurement dimension encoded in each of four column names:
library(data.table)
(iris.dt <- data.table(i=1:nrow(iris), iris[,1:4], Species=paste(iris$Species)))
#> i Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1: 1 5.1 3.5 1.4 0.2 setosa
#> 2: 2 4.9 3.0 1.4 0.2 setosa
#> 3: 3 4.7 3.2 1.3 0.2 setosa
#> 4: 4 4.6 3.1 1.5 0.2 setosa
#> 5: 5 5.0 3.6 1.4 0.2 setosa
#> ---
#> 146: 146 6.7 3.0 5.2 2.3 virginica
#> 147: 147 6.3 2.5 5.0 1.9 virginica
#> 148: 148 6.5 3.0 5.2 2.0 virginica
#> 149: 149 6.2 3.4 5.4 2.3 virginica
#> 150: 150 5.9 3.0 5.1 1.8 virginica
Those four reshape column names can be specified via a regex in
nc::capture_melt_single
. The first argument is the input data table
to reshape, and the subsequent arguments are interpreted as a pattern
which is passed to nc::capture_first_vec
. Any input column names
which match the specified regex will be passed as measure.vars
to
melt
:
(iris.tall <- nc::capture_melt_single(
iris.dt,
part=".*",
"[.]",
dim=".*"))
#> i Species part dim value
#> 1: 1 setosa Sepal Length 5.1
#> 2: 2 setosa Sepal Length 4.9
#> 3: 3 setosa Sepal Length 4.7
#> 4: 4 setosa Sepal Length 4.6
#> 5: 5 setosa Sepal Length 5.0
#> ---
#> 596: 146 virginica Petal Width 2.3
#> 597: 147 virginica Petal Width 1.9
#> 598: 148 virginica Petal Width 2.0
#> 599: 149 virginica Petal Width 2.3
#> 600: 150 virginica Petal Width 1.8
The output comes from joining the result of nc::capture_first_vec
to
the result of melt
. Note the output has two copy columns (i,
Species), one reshape column (value), and two capture columns (part,
dim).
After the data have
been converted to the tall/long table above, we can do a dcast
to get a
wider data table:
(iris.part.cols <- dcast(
iris.tall,
i + Species + dim ~ part))
#> i Species dim Petal Sepal
#> 1: 1 setosa Length 1.4 5.1
#> 2: 1 setosa Width 0.2 3.5
#> 3: 2 setosa Length 1.4 4.9
#> 4: 2 setosa Width 0.2 3.0
#> 5: 3 setosa Length 1.3 4.7
#> ---
#> 296: 148 virginica Width 2.0 3.0
#> 297: 149 virginica Length 5.4 6.2
#> 298: 149 virginica Width 2.3 3.4
#> 299: 150 virginica Length 5.1 5.9
#> 300: 150 virginica Width 1.8 3.0
Rather than use capture_melt_single
and then dcast
, we could
instead use capture_melt_multiple
to get the same result:
nc::capture_melt_multiple(
iris.dt,
column=".*",
"[.]",
dim=".*")
#> i Species dim Petal Sepal
#> 1: 1 setosa Length 1.4 5.1
#> 2: 2 setosa Length 1.4 4.9
#> 3: 3 setosa Length 1.3 4.7
#> 4: 4 setosa Length 1.5 4.6
#> 5: 5 setosa Length 1.4 5.0
#> ---
#> 296: 146 virginica Width 2.3 3.0
#> 297: 147 virginica Width 1.9 2.5
#> 298: 148 virginica Width 2.0 3.0
#> 299: 149 virginica Width 2.3 3.4
#> 300: 150 virginica Width 1.8 3.0
Note that the reshaped table above contains two copy columns (i, Species), two reshape columns (Sepal, Petal), and one capture columns (dim). These can help us visualize whether or not sepals are bigger than petals:
if(require(ggplot2)){
ggplot()+
theme_bw()+
theme(panel.spacing=grid::unit(0, "lines"))+
facet_grid(dim ~ Species)+
coord_equal()+
geom_abline(slope=1, intercept=0, color="grey")+
geom_point(aes(
Petal, Sepal),
data=iris.part.cols)
}
#> Loading required package: ggplot2
It is clear from the plot above that sepals are indeed both longer and wider than petals, on each measured plant.
Another data set where it is useful to do column name pattern matching followed by melting is the World Health Organization data:
if(requireNamespace("tidyr")){
data(who, package="tidyr")
}else{
who <- data.frame(id=1, new_sp_m5564=2, newrel_f65=3)
}
#> Loading required namespace: tidyr
names(who)
#> [1] "country" "iso2" "iso3" "year" "new_sp_m014"
#> [6] "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" "new_sp_m5564"
#> [11] "new_sp_m65" "new_sp_f014" "new_sp_f1524" "new_sp_f2534" "new_sp_f3544"
#> [16] "new_sp_f4554" "new_sp_f5564" "new_sp_f65" "new_sn_m014" "new_sn_m1524"
#> [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564" "new_sn_m65"
#> [26] "new_sn_f014" "new_sn_f1524" "new_sn_f2534" "new_sn_f3544" "new_sn_f4554"
#> [31] "new_sn_f5564" "new_sn_f65" "new_ep_m014" "new_ep_m1524" "new_ep_m2534"
#> [36] "new_ep_m3544" "new_ep_m4554" "new_ep_m5564" "new_ep_m65" "new_ep_f014"
#> [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554" "new_ep_f5564"
#> [46] "new_ep_f65" "newrel_m014" "newrel_m1524" "newrel_m2534" "newrel_m3544"
#> [51] "newrel_m4554" "newrel_m5564" "newrel_m65" "newrel_f014" "newrel_f1524"
#> [56] "newrel_f2534" "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"
Each column which starts with new
has three distinct pieces of
information encoded in its name: diagnosis type (e.g. sp or rel),
gender (m or f), and age range (e.g. 5564 or 1524). We would like to
use a regex to match these column names, then using the matching
columns as measure.vars in a melt, then join the two results.
new.diag.gender <- list(
"new_?",
diagnosis=".*",
"_",
gender=".")
nc::capture_melt_single(who, new.diag.gender, ages=".*")
#> country iso2 iso3 year diagnosis gender ages value
#> 1: Afghanistan AF AFG 1997 sp m 014 0
#> 2: Afghanistan AF AFG 1998 sp m 014 30
#> 3: Afghanistan AF AFG 1999 sp m 014 8
#> 4: Afghanistan AF AFG 2000 sp m 014 52
#> 5: Afghanistan AF AFG 2001 sp m 014 129
#> ---
#> 76042: Viet Nam VN VNM 2013 rel f 65 3110
#> 76043: Wallis and Futuna Islands WF WLF 2013 rel f 65 2
#> 76044: Yemen YE YEM 2013 rel f 65 360
#> 76045: Zambia ZM ZMB 2013 rel f 65 669
#> 76046: Zimbabwe ZW ZWE 2013 rel f 65 725
Note the output includes the new reshape column called value
by
default, as in melt
. The input reshape column
names which matched the specified pattern, and there is a new column
for each group in that pattern. The following example shows how to
rename the value
column and use numeric type conversion functions:
years.pattern <- list(new.diag.gender, ages=list(
min.years="0|[0-9]{2}", as.numeric,
max.years="[0-9]{0,2}", function(x)ifelse(x=="", Inf, as.numeric(x))))
(who.typed <- nc::capture_melt_single(
who, years.pattern,
value.name="count"))
#> country iso2 iso3 year diagnosis gender ages min.years
#> 1: Afghanistan AF AFG 1997 sp m 014 0
#> 2: Afghanistan AF AFG 1998 sp m 014 0
#> 3: Afghanistan AF AFG 1999 sp m 014 0
#> 4: Afghanistan AF AFG 2000 sp m 014 0
#> 5: Afghanistan AF AFG 2001 sp m 014 0
#> ---
#> 76042: Viet Nam VN VNM 2013 rel f 65 65
#> 76043: Wallis and Futuna Islands WF WLF 2013 rel f 65 65
#> 76044: Yemen YE YEM 2013 rel f 65 65
#> 76045: Zambia ZM ZMB 2013 rel f 65 65
#> 76046: Zimbabwe ZW ZWE 2013 rel f 65 65
#> max.years count
#> 1: 14 0
#> 2: 14 30
#> 3: 14 8
#> 4: 14 52
#> 5: 14 129
#> ---
#> 76042: Inf 3110
#> 76043: Inf 2
#> 76044: Inf 360
#> 76045: Inf 669
#> 76046: Inf 725
str(who.typed)
#> Classes 'data.table' and 'data.frame': 76046 obs. of 10 variables:
#> $ country : chr "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
#> $ iso2 : chr "AF" "AF" "AF" "AF" ...
#> $ iso3 : chr "AFG" "AFG" "AFG" "AFG" ...
#> $ year : int 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 ...
#> $ diagnosis: chr "sp" "sp" "sp" "sp" ...
#> $ gender : chr "m" "m" "m" "m" ...
#> $ ages : chr "014" "014" "014" "014" ...
#> $ min.years: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ max.years: num 14 14 14 14 14 14 14 14 14 14 ...
#> $ count : int 0 30 8 52 129 90 127 139 151 193 ...
#> - attr(*, ".internal.selfref")=<externalptr>
The result above shows that
value.name
is used as the column name for the reshaped data in the
output (as in melt.data.table
).In conclusion nc::capture_melt_single
does data
reshaping from wide to tall/long which (1) makes it easy to define complex patterns (2)
supports type conversion without a post-processing step, and (3)
reduces repetition in user code.
In the previous sections we discussed methods for melting wide data columns that all have the same type. Sometimes wide data have columns of different types that we want to melt. The example below is taken from a data.table vignette,
family.dt <- fread(text="
family_id age_mother dob_child1 dob_child2 dob_child3 gender_child1 gender_child2 gender_child3
1 30 1998-11-26 2000-01-29 NA 1 2 NA
2 27 1996-06-22 NA NA 2 NA NA
3 26 2002-07-11 2004-04-05 2007-09-02 2 2 1
4 32 2004-10-10 2009-08-27 2012-07-21 1 1 1
5 29 2000-12-05 2005-02-28 NA 2 1 NA")
The data table above has one row for each family, and two columns for each child. One column is the date of birth (character), the other is the gender (coded as an integer). We can use standard data.table functions to melt these data into a tall/long table with one row for each child:
melt(family.dt, measure.vars=patterns(
dob="^dob", gender="^gender"
))
#> family_id age_mother variable dob gender
#> 1: 1 30 1 1998-11-26 1
#> 2: 2 27 1 1996-06-22 2
#> 3: 3 26 1 2002-07-11 2
#> 4: 4 32 1 2004-10-10 1
#> 5: 5 29 1 2000-12-05 2
#> 6: 1 30 2 2000-01-29 2
#> 7: 2 27 2 <NA> NA
#> 8: 3 26 2 2004-04-05 2
#> 9: 4 32 2 2009-08-27 1
#> 10: 5 29 2 2005-02-28 1
#> 11: 1 30 3 <NA> NA
#> 12: 2 27 3 <NA> NA
#> 13: 3 26 3 2007-09-02 1
#> 14: 4 32 3 2012-07-21 1
#> 15: 5 29 3 <NA> NA
The melt above results in a table with one row for each child, and one
column for each argument of patterns
. However the code is somewhat
repetitive, as dob
and gender
must be repeated. To avoid this
repetition we can instead use:
(children.dt <- nc::capture_melt_multiple(
family.dt,
column=".*",
"_",
nc::field("child", "", "[1-3]", as.integer),
na.rm=TRUE))
#> family_id age_mother child dob gender
#> 1: 1 30 1 1998-11-26 1
#> 2: 2 27 1 1996-06-22 2
#> 3: 3 26 1 2002-07-11 2
#> 4: 4 32 1 2004-10-10 1
#> 5: 5 29 1 2000-12-05 2
#> 6: 1 30 2 2000-01-29 2
#> 7: 3 26 2 2004-04-05 2
#> 8: 4 32 2 2009-08-27 1
#> 9: 5 29 2 2005-02-28 1
#> 10: 3 26 3 2007-09-02 1
#> 11: 4 32 3 2012-07-21 1
str(children.dt)
#> Classes 'data.table' and 'data.frame': 11 obs. of 5 variables:
#> $ family_id : int 1 2 3 4 5 1 3 4 5 3 ...
#> $ age_mother: int 30 27 26 32 29 30 26 32 29 26 ...
#> $ child : int 1 1 1 1 1 2 2 2 2 3 ...
#> $ dob : IDate, format: "1998-11-26" "1996-06-22" ...
#> $ gender : int 1 2 2 1 2 2 2 1 1 1 ...
#> - attr(*, ".internal.selfref")=<externalptr>
The result above is similar to the previous result (and in fact
melt.data.table
is used internally), but the code is less
repetitive. The first argument of capture_melt_multiple
is the
subject data table and the following arguments form a pattern which is
matched to the input data column names. The pattern must have at least
two groups:
column
. This group is used to capture the
part of the input column names which will be used for the output
column names. The output will contain one column for each unique
value captured in this group.child
number).Another example is the iris data set which usually has one
5-dimensional observation per row (four numeric measures, and one
character Species
). Below we create a shuffled version of the iris
data with two observations per row:
set.seed(1)
iris.rand <- iris.dt[sample(.N)]
iris.wide <- cbind(treatment=iris.rand[1:75], control=iris.rand[76:150])
print(iris.wide, topn=2, nrows=10)
#> treatment.i treatment.Sepal.Length treatment.Sepal.Width
#> 1: 68 5.8 2.7
#> 2: 129 6.4 2.8
#> ---
#> 74: 91 5.5 2.6
#> 75: 64 6.1 2.9
#> treatment.Petal.Length treatment.Petal.Width treatment.Species control.i
#> 1: 4.1 1.0 versicolor 60
#> 2: 5.6 2.1 virginica 113
#> ---
#> 74: 4.4 1.2 versicolor 57
#> 75: 4.7 1.4 versicolor 72
#> control.Sepal.Length control.Sepal.Width control.Petal.Length
#> 1: 5.2 2.7 3.9
#> 2: 6.8 3.0 5.5
#> ---
#> 74: 6.3 3.3 4.7
#> 75: 6.1 2.8 4.0
#> control.Petal.Width control.Species
#> 1: 1.4 versicolor
#> 2: 2.1 virginica
#> ---
#> 74: 1.6 versicolor
#> 75: 1.3 versicolor
Below we show the usual data.table syntax for getting the original iris back.
iris.melted <- melt(iris.wide, measure.vars = patterns(
i="i$",
Sepal.Length="Sepal.Length$",
Sepal.Width="Sepal.Width$",
Petal.Length="Petal.Length$",
Petal.Width="Petal.Width$",
Species="Species$"))
identical(iris.melted[order(i), names(iris.dt), with=FALSE], iris.dt)
#> [1] TRUE
We can do the same thing with less repetition via:
(nc.melted <- nc::capture_melt_multiple(
iris.wide,
group=".*?",
"[.]",
column=".*"))
#> group Petal.Length Petal.Width Sepal.Length Sepal.Width Species i
#> 1: control 3.9 1.4 5.2 2.7 versicolor 60
#> 2: control 5.5 2.1 6.8 3.0 virginica 113
#> 3: control 5.6 1.4 6.1 2.6 virginica 135
#> 4: control 1.5 0.1 4.9 3.1 setosa 10
#> 5: control 1.4 0.2 5.1 3.5 setosa 1
#> ---
#> 146: treatment 1.6 0.2 4.8 3.1 setosa 31
#> 147: treatment 1.3 0.4 5.4 3.9 setosa 17
#> 148: treatment 5.4 2.1 6.9 3.1 virginica 140
#> 149: treatment 4.4 1.2 5.5 2.6 versicolor 91
#> 150: treatment 4.7 1.4 6.1 2.9 versicolor 64
identical(nc.melted[order(i), names(iris.dt), with=FALSE], iris.dt)
#> [1] TRUE
To conclude this section, we have seen that a simple non-repetitive
regex syntax can be used with nc::capture_melt_multiple
to
melt wide data into several columns of different types.