library(joyn)
#>
#> Attaching package: 'joyn'
#> The following object is masked from 'package:base':
#>
#> merge
library(data.table)
<- data.table(
x id = c(1, 4, 2, 3, NA),
t = c(1L, 2L, 1L, 2L, NA),
country = c(16, 12, 3, NA, 15)
)
<- data.table(
y id = c(1, 2, 5, 6, 3),
gdp = c(11L, 15L, 20L, 13L, 10L),
country = 16:20
)
Imagine that variable country is available in the two tables that you’re going to join. You could do one of five things:
This is the default if you don’t use the argument by
# The variables with the same name, `id` and `x`, are used as key
# variables.
merge(x, y)[]
#> > removing key variables `id` and `country` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 4 44.4%
#> 2: x & y 1 11.1%
#> 3: y 4 44.4%
#> 4: total 9 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id country t gdp report
#> 1: 1 16 1 11 x & y
#> 2: 2 3 1 NA x
#> 3: 2 17 NA 15 y
#> 4: 3 20 NA 10 y
#> 5: 3 NA 2 NA x
#> 6: 4 12 2 NA x
#> 7: 5 18 NA 20 y
#> 8: 6 19 NA 13 y
#> 9: NA 15 NA NA x
This the default if you did not include country as part of the key variables in argument by
.
merge(x, y, by = "id")[]
#> > removing key variables `id` from yvars
#> i variable `country` in table y is ignored because arguments `update_NAs` and
#> `update_values` are FALSE.
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 28.6%
#> 2: x & y 3 42.9%
#> 3: y 2 28.6%
#> 4: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t country gdp report
#> 1: 1 1 16 11 x & y
#> 2: 2 1 3 15 x & y
#> 3: 3 2 NA 10 x & y
#> 4: 4 2 12 NA x
#> 5: 5 NA NA 20 y
#> 6: 6 NA NA 13 y
#> 7: NA NA 15 NA x
Update only NAs values in variable country in table x (i.e., left or master) with the actual values of the matching observations in country from table y (i.e, right or using). In this case, actual values in country from table x will remain unchanged.
merge(x, y, by = "id", update_NAs = TRUE)[]
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: NA updated 1 14.3%
#> 2: x 2 28.6%
#> 3: x & y 2 28.6%
#> 4: y 2 28.6%
#> 5: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t country gdp report
#> 1: 1 1 16 11 x & y
#> 2: 2 1 3 15 x & y
#> 3: 3 2 20 10 NA updated
#> 4: 4 2 12 NA x
#> 5: 5 NA 18 20 y
#> 6: 6 NA 19 13 y
#> 7: NA NA 15 NA x
Update all the values–NAs and actual–in variable country of table x (i.e., left or master) with the actual values of the matching observations in country from table y (i.e, right or using). Also, notice the output of the table below: only the values that are different between country from x and country from y are updated. If, for the same matching observations, the values between the two country variables are the same, the reporting variable says “x & y” rather than “value updated”, so you know that there is no update to make. Finally, If there are NAs in variable country from table y, the actual values in x will be unchanged, and you get a “not updated” status in the reporting variable. Yet, there is a way to solve this, which explained below using argument keep_y_in_x
.
# notice that only the value that are
merge(x, y, by = "id", update_values = TRUE)[]
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: NA updated 1 14.3%
#> 2: not updated 2 28.6%
#> 3: value updated 1 14.3%
#> 4: x & y 1 14.3%
#> 5: y 2 28.6%
#> 6: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t country gdp report
#> 1: 1 1 16 11 x & y
#> 2: 2 1 17 15 value updated
#> 3: 3 2 20 10 NA updated
#> 4: 4 2 12 NA not updated
#> 5: 5 NA 18 20 y
#> 6: 6 NA 19 13 y
#> 7: NA NA 15 NA not updated
Bring variable country from y into the resulting table with a different name (country.y).
merge(x, y, by = "id", keep_y_in_x = TRUE)[]
#> > removing key variables `id` from yvars
#> i variable `country` in table y is ignored because arguments `update_NAs` and
#> `update_values` are FALSE.
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 28.6%
#> 2: x & y 3 42.9%
#> 3: y 2 28.6%
#> 4: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t country gdp country.y report
#> 1: 1 1 16 11 16 x & y
#> 2: 2 1 3 15 17 x & y
#> 3: 3 2 NA 10 20 x & y
#> 4: 4 2 12 NA NA x
#> 5: 5 NA NA 20 18 y
#> 6: 6 NA NA 13 19 y
#> 7: NA NA 15 NA NA x
If you need to update country in x, including the actual values with NAs from country in y, you could do the following, keeping in mind that the reporting variable is not that useful anymore.
<- merge(x, y, by = "id", keep_y_in_x = TRUE, verbose = FALSE)
dd := country.y
dd[, country := NULL]
][, country.y
dd[]#> id t country gdp report
#> 1: 1 1 16 11 x & y
#> 2: 2 1 17 15 x & y
#> 3: 3 2 20 10 x & y
#> 4: 4 2 NA NA x
#> 5: 5 NA 18 20 y
#> 6: 6 NA 19 13 y
#> 7: NA NA NA NA x
Most of the times, you don’t need to join all the variable from table y into table x, but just a few of them. By default, joyn
brings all the variables in y into x because the argument yvars
is set to TRUE
. Yet, if you set it to FALSE
or NULL
it will bring NO variable from y into x, only only the reporting variable, which many times is very nice trick. Finally, if you set the yvars
argument to the specific name of variable in y, only that variable will be carried over in the join.
# Bring all variables updating values
merge(x, y, by = "id", update_values = TRUE)[]
#> > removing key variables `id` from yvars
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: NA updated 1 14.3%
#> 2: not updated 2 28.6%
#> 3: value updated 1 14.3%
#> 4: x & y 1 14.3%
#> 5: y 2 28.6%
#> 6: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t country gdp report
#> 1: 1 1 16 11 x & y
#> 2: 2 1 17 15 value updated
#> 3: 3 2 20 10 NA updated
#> 4: 4 2 12 NA not updated
#> 5: 5 NA 18 20 y
#> 6: 6 NA 19 13 y
#> 7: NA NA 15 NA not updated
# Bring no variables from y
merge(x, y, by = "id", yvars = FALSE)[]
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 28.6%
#> 2: x & y 3 42.9%
#> 3: y 2 28.6%
#> 4: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t country report
#> 1: 1 1 16 x & y
#> 2: 2 1 3 x & y
#> 3: 3 2 NA x & y
#> 4: 4 2 12 x
#> 5: 5 NA NA y
#> 6: 6 NA NA y
#> 7: NA NA 15 x
# Bring only variable gdp from table y. So variable country won't be updated either because table y is filtered before the coparison of variables is done.
merge(x, y, by = "id", yvars = "gdp")[]
#>
#> -- JOYn Report --
#>
#> report n percent
#> 1: x 2 28.6%
#> 2: x & y 3 42.9%
#> 3: y 2 28.6%
#> 4: total 7 100%
#> ---------------------------------------------------------- End of JOYn report --
#> id t country gdp report
#> 1: 1 1 16 11 x & y
#> 2: 2 1 3 15 x & y
#> 3: 3 2 NA 10 x & y
#> 4: 4 2 12 NA x
#> 5: 5 NA NA 20 y
#> 6: 6 NA NA 13 y
#> 7: NA NA 15 NA x
Coming soon
Coming soon