Bivariate analysis of continuous and/or categorical variables

2021-07-06

Tidycomm includes four functions for bivariate explorative data analysis:

We will again use sample data from the Worlds of Journalism 2012-16 study for demonstration purposes:

WoJ
#> # A tibble: 1,200 x 15
#>    country   reach   employment temp_contract autonomy_selecti~ autonomy_emphas~
#>    <fct>     <fct>   <chr>      <fct>                     <dbl>            <dbl>
#>  1 Germany   Nation~ Full-time  Permanent                     5                4
#>  2 Germany   Nation~ Full-time  Permanent                     3                4
#>  3 Switzerl~ Region~ Full-time  Permanent                     4                4
#>  4 Switzerl~ Local   Part-time  Permanent                     4                5
#>  5 Austria   Nation~ Part-time  Permanent                     4                4
#>  6 Switzerl~ Local   Freelancer <NA>                          4                4
#>  7 Germany   Local   Full-time  Permanent                     4                4
#>  8 Denmark   Nation~ Full-time  Permanent                     3                3
#>  9 Switzerl~ Local   Full-time  Permanent                     5                5
#> 10 Denmark   Nation~ Full-time  Permanent                     2                4
#> # ... with 1,190 more rows, and 9 more variables: ethics_1 <dbl>,
#> #   ethics_2 <dbl>, ethics_3 <dbl>, ethics_4 <dbl>, work_experience <dbl>,
#> #   trust_parliament <dbl>, trust_government <dbl>, trust_parties <dbl>,
#> #   trust_politicians <dbl>

Compute contingency tables and Chi-square tests

crosstab() outputs a contingency table for one independent (column) variable and one or more dependent (row) variables:

WoJ %>% 
  crosstab(reach, employment)
#> # A tibble: 3 x 5
#>   employment Local Regional National Transnational
#>   <chr>      <dbl>    <dbl>    <dbl>         <dbl>
#> 1 Freelancer    23       36      104             9
#> 2 Full-time    111      287      438            66
#> 3 Part-time     15       32       75             4

Additional options include add_total (adds a row-wise Total column if set to TRUE) and percentages (outputs column-wise percentages instead of absolute values if set to TRUE):

WoJ %>% 
  crosstab(reach, employment, add_total = TRUE, percentages = TRUE)
#> # A tibble: 3 x 6
#>   employment Local Regional National Transnational Total
#>   <chr>      <dbl>    <dbl>    <dbl>         <dbl> <dbl>
#> 1 Freelancer 0.154   0.101     0.169        0.114  0.143
#> 2 Full-time  0.745   0.808     0.710        0.835  0.752
#> 3 Part-time  0.101   0.0901    0.122        0.0506 0.105

Setting chi_square = TRUE computes a \(\chi^2\) test including Cramer’s \(V\) and outputs the results in a console message:

WoJ %>% 
  crosstab(reach, employment, chi_square = TRUE)
#> Chi-square = 16.005239, df = 6.000000, p = 0.013726, V = 0.081663
#> # A tibble: 3 x 5
#>   employment Local Regional National Transnational
#>   <chr>      <dbl>    <dbl>    <dbl>         <dbl>
#> 1 Freelancer    23       36      104             9
#> 2 Full-time    111      287      438            66
#> 3 Part-time     15       32       75             4

Finally, passing multiple row variables will treat all unique value combinations as a single variable for percentage and Chi-square computations:

WoJ %>% 
  crosstab(reach, employment, country, percentages = TRUE)
#> # A tibble: 15 x 6
#>    employment country       Local Regional National Transnational
#>    <chr>      <fct>         <dbl>    <dbl>    <dbl>         <dbl>
#>  1 Freelancer Austria     0.0134   0.0113   0.0162         0     
#>  2 Freelancer Denmark     0.0537   0.0197   0.112          0.0127
#>  3 Freelancer Germany     0.0470   0.0507   0.00648        0     
#>  4 Freelancer Switzerland 0.0403   0.00845  0.00162        0     
#>  5 Freelancer UK          0        0.0113   0.0324         0.101 
#>  6 Full-time  Austria     0.0403   0.180    0.152          0.0127
#>  7 Full-time  Denmark     0.168    0.192    0.295          0     
#>  8 Full-time  Germany     0.268    0.172    0.0616         0     
#>  9 Full-time  Switzerland 0.168    0.197    0.0875         0.0633
#> 10 Full-time  UK          0.101    0.0676   0.113          0.759 
#> 11 Part-time  Austria     0        0.0225   0.0292         0     
#> 12 Part-time  Denmark     0.00671  0.0113   0.0178         0     
#> 13 Part-time  Germany     0        0.00282  0.00648        0     
#> 14 Part-time  Switzerland 0.0872   0.0479   0.0632         0     
#> 15 Part-time  UK          0.00671  0.00563  0.00486        0.0506

Compute t-Tests

Use t_test() to quickly compute t-Tests for a group variable and one or more test variables. Output includes test statistics, descriptive statistics and Cohen’s \(d\) effect size estimates:

WoJ %>% 
  t_test(temp_contract, autonomy_selection, autonomy_emphasis)
#> # A tibble: 2 x 10
#>   Variable M_Permanent SD_Permanent M_Temporary SD_Temporary Delta_M     t    df
#>   <chr>          <dbl>        <dbl>       <dbl>        <dbl>   <dbl> <dbl> <dbl>
#> 1 autonom~        3.91        0.755        3.70        0.932   0.212  1.96   998
#> 2 autonom~        4.12        0.768        3.89        0.870   0.237  2.17   995
#> # ... with 2 more variables: p <dbl>, d <dbl>

Passing no test variables will compute t-Tests for all numerical variables in the data:

WoJ %>% 
  t_test(temp_contract)
#> # A tibble: 11 x 10
#>    Variable     M_Permanent SD_Permanent M_Temporary SD_Temporary Delta_M      t
#>    <chr>              <dbl>        <dbl>       <dbl>        <dbl>   <dbl>  <dbl>
#>  1 autonomy_se~        3.91        0.755        3.70        0.932  0.212   1.96 
#>  2 autonomy_em~        4.12        0.768        3.89        0.870  0.237   2.17 
#>  3 ethics_1            1.57        0.850        1.98        0.990 -0.414  -3.41 
#>  4 ethics_2            3.24        1.26         3.51        1.23  -0.269  -1.51 
#>  5 ethics_3            2.37        1.12         2.28        0.928  0.0862  0.549
#>  6 ethics_4            2.53        1.24         2.57        1.22  -0.0323 -0.185
#>  7 work_experi~       17.7        10.5         11.3        11.8    6.42    4.29 
#>  8 trust_parli~        3.07        0.797        3.02        0.772  0.0539  0.480
#>  9 trust_gover~        2.87        0.847        2.64        0.811  0.229   1.92 
#> 10 trust_parti~        2.43        0.724        2.36        0.736  0.0719  0.703
#> 11 trust_polit~        2.53        0.707        2.40        0.689  0.136   1.37 
#> # ... with 3 more variables: df <dbl>, p <dbl>, d <dbl>

If passing a group variable with more than two unique levels, t_test() will produce a warning and default to the first two unique values. You can manually define the levels by setting the levels argument:

WoJ %>% 
  t_test(employment, autonomy_selection, autonomy_emphasis)
#> Warning: employment has more than 2 levels, defaulting to first two (Full-time
#> and Part-time). Consider filtering your data or setting levels with the levels
#> argument
#> # A tibble: 2 x 10
#>   Variable     `M_Full-time` `SD_Full-time` `M_Part-time` `SD_Part-time` Delta_M
#>   <chr>                <dbl>          <dbl>         <dbl>          <dbl>   <dbl>
#> 1 autonomy_se~          3.90          0.782          3.83          0.633  0.0780
#> 2 autonomy_em~          4.12          0.781          4.02          0.759  0.102 
#> # ... with 4 more variables: t <dbl>, df <dbl>, p <dbl>, d <dbl>

WoJ %>% 
  t_test(employment, autonomy_selection, autonomy_emphasis, levels = c("Full-time", "Freelancer"))
#> # A tibble: 2 x 10
#>   Variable `M_Full-time` `SD_Full-time` M_Freelancer SD_Freelancer Delta_M     t
#>   <chr>            <dbl>          <dbl>        <dbl>         <dbl>   <dbl> <dbl>
#> 1 autonom~          3.90          0.782         3.76         0.993   0.139  2.03
#> 2 autonom~          4.12          0.781         3.90         0.852   0.217  3.29
#> # ... with 3 more variables: df <dbl>, p <dbl>, d <dbl>

Additional options include:

Compute one-way ANOVAs

unianova() will compute one-way ANOVAs for one group variable and one or more test variables. Output includes test statistics and \(\eta^2\) effect size estimates.

WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis)
#> # A tibble: 2 x 6
#>   Var                    F df_num df_denom       p eta_squared
#>   <chr>              <dbl>  <dbl>    <dbl>   <dbl>       <dbl>
#> 1 autonomy_selection  2.42      2     1194 0.0896      0.00403
#> 2 autonomy_emphasis   5.86      2     1192 0.00293     0.00974

Descriptives can be added by setting descriptives = TRUE. If no test variables are passed, all numerical variables in the data will be used:

WoJ %>% 
  unianova(employment, descriptives = TRUE)
#> # A tibble: 11 x 12
#>    Var        F df_num df_denom       p eta_squared `M_Full-time` `SD_Full-time`
#>    <chr>  <dbl>  <dbl>    <dbl>   <dbl>       <dbl>         <dbl>          <dbl>
#>  1 auto~  2.42       2     1194 8.96e-2    0.00403           3.90          0.782
#>  2 auto~  5.86       2     1192 2.93e-3    0.00974           4.12          0.781
#>  3 ethi~  2.17       2     1197 1.15e-1    0.00361           1.62          0.895
#>  4 ethi~  2.20       2     1197 1.11e-1    0.00367           3.24          1.27 
#>  5 ethi~  5.33       2     1197 4.98e-3    0.00882           2.39          1.13 
#>  6 ethi~  3.45       2     1197 3.20e-2    0.00574           2.58          1.25 
#>  7 work~  4.48       2     1184 1.15e-2    0.00752          17.5          10.7  
#>  8 trus~  1.53       2     1197 2.18e-1    0.00254           3.06          0.798
#>  9 trus~ 12.9        2     1197 2.97e-6    0.0210            2.82          0.835
#> 10 trus~  0.842      2     1197 4.31e-1    0.00141           2.42          0.720
#> 11 trus~  0.328      2     1197 7.21e-1    0.000547          2.52          0.704
#> # ... with 4 more variables: M_Part-time <dbl>, SD_Part-time <dbl>,
#> #   M_Freelancer <dbl>, SD_Freelancer <dbl>

You can also compute Tukey’s HSD post-hoc tests by setting post_hoc = TRUE. Results will be added as a tibble in a list column post_hoc.

WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis, post_hoc = TRUE)
#> # A tibble: 2 x 7
#>   Var                    F df_num df_denom       p eta_squared post_hoc        
#>   <chr>              <dbl>  <dbl>    <dbl>   <dbl>       <dbl> <list>          
#> 1 autonomy_selection  2.42      2     1194 0.0896      0.00403 <tibble [3 x 7]>
#> 2 autonomy_emphasis   5.86      2     1192 0.00293     0.00974 <tibble [3 x 7]>

These can then be unnested with tidyr::unnest():

WoJ %>% 
  unianova(employment, autonomy_selection, autonomy_emphasis, post_hoc = TRUE) %>% 
  dplyr::select(Var, post_hoc) %>% 
  tidyr::unnest(post_hoc)
#> # A tibble: 6 x 8
#>   Var       term   contrast   null.value estimate conf.low conf.high adj.p.value
#>   <chr>     <chr>  <chr>           <dbl>    <dbl>    <dbl>     <dbl>       <dbl>
#> 1 autonomy~ emplo~ Part-time~          0  -0.0780   -0.257    0.101      0.562  
#> 2 autonomy~ emplo~ Freelance~          0  -0.139    -0.296    0.0186     0.0966 
#> 3 autonomy~ emplo~ Freelance~          0  -0.0607   -0.282    0.160      0.796  
#> 4 autonomy~ emplo~ Part-time~          0  -0.102    -0.278    0.0741     0.362  
#> 5 autonomy~ emplo~ Freelance~          0  -0.217    -0.372   -0.0629     0.00284
#> 6 autonomy~ emplo~ Freelance~          0  -0.115    -0.333    0.102      0.428

Compute correlation tables and matrices

correlate() will compute correlations for all combinations of the passed variables:

WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis)
#> # A tibble: 3 x 5
#>   x                  y                      r    df         p
#>   <chr>              <chr>              <dbl> <int>     <dbl>
#> 1 work_experience    autonomy_selection 0.161  1182 2.71e-  8
#> 2 work_experience    autonomy_emphasis  0.155  1180 8.87e-  8
#> 3 autonomy_selection autonomy_emphasis  0.644  1192 4.83e-141

If no variables passed, correlations for all combinations of numerical variables will be computed:

WoJ %>% 
  correlate()
#> # A tibble: 55 x 5
#>    x                  y                        r    df         p
#>    <chr>              <chr>                <dbl> <int>     <dbl>
#>  1 autonomy_selection autonomy_emphasis  0.644    1192 4.83e-141
#>  2 autonomy_selection ethics_1          -0.0766   1195 7.98e-  3
#>  3 autonomy_selection ethics_2          -0.0274   1195 3.43e-  1
#>  4 autonomy_selection ethics_3          -0.0257   1195 3.73e-  1
#>  5 autonomy_selection ethics_4          -0.0781   1195 6.89e-  3
#>  6 autonomy_selection work_experience    0.161    1182 2.71e-  8
#>  7 autonomy_selection trust_parliament  -0.00840  1195 7.72e-  1
#>  8 autonomy_selection trust_government   0.0414   1195 1.53e-  1
#>  9 autonomy_selection trust_parties      0.0269   1195 3.52e-  1
#> 10 autonomy_selection trust_politicians  0.0109   1195 7.07e-  1
#> # ... with 45 more rows

By default, Pearson’s product-moment correlations coefficients (\(r\)) will be computed. Set method to "kendall" to obtain Kendall’s \(\tau\) or to "spearman" to obtain Spearman’s \(\rho\) instead.

To obtain a correlation matrix, pass the output of correlate() to to_correlation_matrix():

WoJ %>% 
  correlate(work_experience, autonomy_selection, autonomy_emphasis) %>% 
  to_correlation_matrix()
#> # A tibble: 3 x 4
#>   r                  work_experience autonomy_selection autonomy_emphasis
#>   <chr>                        <dbl>              <dbl>             <dbl>
#> 1 work_experience              1                  0.161             0.155
#> 2 autonomy_selection           0.161              1                 0.644
#> 3 autonomy_emphasis            0.155              0.644             1