Notice too that joyn::merge
masks base::merge
, which is ideal.
library(joyn)
library(data.table)
= data.table(id = c(1L, 1L, 2L, 3L, NA_integer_),
x1 t = c(1L, 2L, 1L, 2L, NA_integer_),
x = 11:15)
= data.table(id = c(1,2, 4),
y1 y = c(11L, 15L, 16))
= data.table(id = c(1, 4, 2, 3, NA),
x2 t = c(1L, 2L, 1L, 2L, NA_integer_),
x = c(16, 12, NA, NA, 15))
= data.table(id = c(1, 2, 5, 6, 3),
y2 yd = c(1, 2, 5, 6, 3),
y = c(11L, 15L, 20L, 13L, 10L),
x = c(16:20))
<- data.table(id = c("c","b", "c", "a"),
y3 y = c(11L, 15L, 18L, 20L))
<- data.table(id=c("c","b", "d"),
x3 v=8:10,
foo=c(4,2, 7))
= data.table(id1 = c(1, 1, 2, 3, 3),
x4 id2 = c(1, 1, 2, 3, 4),
t = c(1L, 2L, 1L, 2L, NA_integer_),
x = c(16, 12, NA, NA, 15))
= data.table(id = c(1, 2, 5, 6, 3),
y4 id2 = c(1, 1, 2, 3, 4),
y = c(11L, 15L, 20L, 13L, 10L),
x = c(16:20))
You can see in the examples below that by default joyn
displays a small summary table.
# Natural join. i.e., using as key all the variables that appear in both tables.
# without display the resulting table.
merge(x1, y1)
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 33.3%
#> 2: x & y 3 50%
#> 3: y 1 16.7%
#> 4: total 6 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t x y report
#> 1: 1 1 11 11 x & y
#> 2: 1 2 12 11 x & y
#> 3: 2 1 13 15 x & y
#> 4: 3 2 14 NA x
#> 5: 4 NA NA 16 y
#> 6: NA NA 15 NA x
You can use the reporting format of the report
variables, by setting the argument reporttype
to numeric, similar to Stata’s output.
code | report | meaning |
---|---|---|
1 | x | Only available in x table |
2 | y | Only available in y table |
3 | x & y | Matching obs available in both tables |
4 | NA updated | NAs in x updated with actual values in variables with same names in y |
5 | value updated | Actual values and NAs in x updated with actual values in variables with same names in y |
6 | not updated | Actual values and NAs in x are NOT updated with actual values in y |
Since the object returned is a data.table
, we display the results in the console by concatenating a pair of squared brackets at the end of the call. That is merge(x1, y1)[]
instead of merge(x1, y1)
.
# Natural join. display the results.
merge(x1, y1)[]
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 33.3%
#> 2: x & y 3 50%
#> 3: y 1 16.7%
#> 4: total 6 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t x y report
#> 1: 1 1 11 11 x & y
#> 2: 1 2 12 11 x & y
#> 3: 2 1 13 15 x & y
#> 4: 3 2 14 NA x
#> 5: 4 NA NA 16 y
#> 6: NA NA 15 NA x
# Making explicit the key variables (strongly suggested)
merge(x1, y1, by = "id")[]
#> > removing key variables `id` from yvars
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 33.3%
#> 2: x & y 3 50%
#> 3: y 1 16.7%
#> 4: total 6 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t x y report
#> 1: 1 1 11 11 x & y
#> 2: 1 2 12 11 x & y
#> 3: 2 1 13 15 x & y
#> 4: 3 2 14 NA x
#> 5: 4 NA NA 16 y
#> 6: NA NA 15 NA x
If the key variables have different names in both tables, you can create the equivalency as an element of by
vector.
# joining by id1 in x and id in y
merge(x4, y4, by = c("id1 = id"))
#> > removing key variables `keyby1` from yvars
#> i variables `id2` and `x` in table y are ignored because arguments `update_NAs`
#> and `update_values` are FALSE.
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x & y 5 71.4%
#> 2: y 2 28.6%
#> 3: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id1 id2 t x y report
#> 1: 1 1 1 16 11 x & y
#> 2: 1 1 2 12 11 x & y
#> 3: 2 2 1 NA 15 x & y
#> 4: 3 3 2 NA 10 x & y
#> 5: 3 4 NA 15 10 x & y
#> 6: 5 NA NA NA 20 y
#> 7: 6 NA NA NA 13 y
# joining by id1 in x and id in y, and id2 in both
merge(x4, y4, by = c("id1 = id", "id2"))
#> > removing key variables `keyby1` and `id2` from yvars
#> i variable `x` in table y is ignored because arguments `update_NAs` and
#> `update_values` are FALSE.
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 25%
#> 2: x & y 3 37.5%
#> 3: y 3 37.5%
#> 4: total 8 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id1 id2 t x y report
#> 1: 1 1 1 16 11 x & y
#> 2: 1 1 2 12 11 x & y
#> 3: 2 2 1 NA NA x
#> 4: 2 1 NA NA 15 y
#> 5: 3 3 2 NA NA x
#> 6: 3 4 NA 15 10 x & y
#> 7: 5 2 NA NA 20 y
#> 8: 6 3 NA NA 13 y
Match type refers to the correspondence that exists between the observations of the joining tables. Following Stata’s convention, we could have four different match types, one-to-one (1:1) , one-to-many (1:m), many-to-one (m:1), and many-to-many (m:m) joins. The default is m:m, following general R’s practice, but its use is highly discouraged. We recommend you always specify the match type when joining tables to ensure the output is correct. Indeed, if you don’t care about match types or you don’t think it is necessary to use them for your particular needs, you might be fine without joyn
.
Using the same wording of the Stata manual
1:1: specifies a one-to-one match merge. The variables specified in by
uniquely identify single observations in both table.
1:m and m:1: specify one-to-many and many-to-one match merges, respectively. This means that in of the tables the observations are uniquely identify by the variables in by,
while in the other table many (two or more) of the observations are identify by the variables in by
.
m:m refers to many-to-many joins. variables in by
does not uniquely identify the observations in either table. Matching is performed by combining observations with equal values in by
; within matching values, the first observation in the master (i.e. left or x) table is matched with the first matching observation in the using (i.e. right or y) table; the second, with the second; and so on. If there is an unequal number of observations within a group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group.
Suppose you think your data is uniquely identified by variable id
, it is not . If you used match_type = "1:1"
you will get and error, letting you know that something is not right.
# Merging correctly but getting error because something is not right in the data
merge(x3, y3, by = "id", match_type = "1:1")
#> x table y is not uniquely identified by `id`
#> Error: match type inconsistency
#> i you could use `return_report = TRUE` in `joyn::is_id()`
#> to see where the problem is
# Merging wrongly but getting NO errors because you did not use match_type
merge(x3, y3, by = "id")
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 1 20%
#> 2: x & y 3 60%
#> 3: y 1 20%
#> 4: total 5 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id v foo y report
#> 1: a NA NA 20 y
#> 2: b 9 2 15 x & y
#> 3: c 8 4 11 x & y
#> 4: c 8 4 18 x & y
#> 5: d 10 7 NA x
Join type refers to observations that are kept after the join. To avoid confusion with argument match_type
, I decided not to name this argument as join_type
but as keep
. At the end, you want to specify which observations you want to keep. This argument plays the role of mimicking the behavior of dplyr
’s functions left_join
, right_join
, inner_join
, and full_join
, the default.
keep = "full"
, the default, keeps all the observations in x and y, regardless of whether they match or not.
# Full merge
merge(x1, y1)
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 33.3%
#> 2: x & y 3 50%
#> 3: y 1 16.7%
#> 4: total 6 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t x y report
#> 1: 1 1 11 11 x & y
#> 2: 1 2 12 11 x & y
#> 3: 2 1 13 15 x & y
#> 4: 3 2 14 NA x
#> 5: 4 NA NA 16 y
#> 6: NA NA 15 NA x
keep = "left"
or keep = "master"
keeps all observations in x, regardless of whether they match or not with y.
# keep obs in x
merge(x1, y1, keep = "left")
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 40%
#> 2: x & y 3 60%
#> 3: total 5 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t x y report
#> 1: 1 1 11 11 x & y
#> 2: 1 2 12 11 x & y
#> 3: 2 1 13 15 x & y
#> 4: 3 2 14 NA x
#> 5: NA NA 15 NA x
keep = "right"
or keep = "using"
keeps all observations in y, regardless of whether they match or not with x.
# keep obs in y
merge(x1, y1, keep = "right")
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x & y 3 75%
#> 2: y 1 25%
#> 3: total 4 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t x y report
#> 1: 1 1 11 11 x & y
#> 2: 1 2 12 11 x & y
#> 3: 2 1 13 15 x & y
#> 4: 4 NA NA 16 y
keep = "inner"
keeps only the observations that match in both tables.
# keep matching obs in both tables
merge(x1, y1, keep = "inner")
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x & y 3 100%
#> 2: total 3 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t x y report
#> 1: 1 1 11 11 x & y
#> 2: 1 2 12 11 x & y
#> 3: 2 1 13 15 x & y