The analysis of proportions is of two primary types.
From standard base R functions, the lessR function
Prop_test()
, abbreviated prop()
, provides
either type of analysis. To use, generally enter either the original
data from which to compute the frequencies and then the sample
proportions, or enter already computed frequencies. For the analysis of
multiple categorical variables across two levels of one of the
variables, the test of homogeneity and the test of
independence yield the identical statistical result.
The following table summarizes the values of the
Prop_test()
parameters for different analyses of
proportions. Each function call for the analysis of data begins with the
name of a categorical variable, generically referred to as
X
. The value of X
is the first parameter in
the function definition, and so does not need its parameter name,
variable
. If needed, indicate a second categorical
variable, generically referred to as Y
, with the
by
parameter. If focused on a specific value of
X
as a success, referred to as X_value
,
indicate that value with the success
parameter.
Run each analysis either directly from pre-computed values of the sample proportions, or from the original data from which the sample proportions are calculated.
Evaluate | Data Parameters | Count Parameters |
---|---|---|
A hypothesized proportion | X, success =X_value |
n_succ , n_tot [scalars] |
Equal proportions across samples | X, success =X_value, by =Y |
n_succ , n_tot [vectors] |
Uniform goodness-of-fit | X | ntot [vector] |
Independence of two variables | X, by =Y |
n_table |
The remainder of this vignette illustrates these applications of
Prop_test()
.
Define the occurrence of a designated value of the
variable
as a success
. Define all other values
of the variable as failures. Of course, success or failure in this
context does not necessarily mean good or bad, desired or undesired, but
instead, a designated value either occurred or did not.
When analyzing proportions from data, first indicate the categorical
variable, the value of the parameter variable
. Next,
indicate the designated value of variable
with the
parameter success
. When entering proportions directly,
indicate the number of successes and the total number of trials with the
n_succ
and n_tot
parameters. Enter the value
of each parameter either as a single value for one sample or as a vector
of multiple values for multiple samples. Without a value for
success
or n_succ
the analysis is of
goodness-of-fit or independence.
The example below is from the documentation for the base R function
binom.test()
, which provides the exact test of a null
hypothesis regarding the probability of success.
Prop_test()
uses that base R function to compare a sample
proportion to a hypothesized population value.
For a given categorical variable of interest, a type of plant,
consider two values, either “giant” or “dwarf”. From a sample of 925
plants, the specified value of “giant” occurred 682 times and did not
occur 243 times. The null hypothesis tested is that the specified value
occurs for 3/4 of the population according to the pi
parameter.
Prop_test(n_succ=682, n_fail=243, pi=.75)
##
## <<< Exact binomial test of a proportion
##
## ------ Describe ------
##
## Number of successes: 682
## Number of failures: 243
## Number of trials: 925
## Sample proportion: 0.737
##
## ------ Infer ------
##
## Hypothesis test for null of 0.75, p-value: 0.382
## 95% Confidence interval: 0.708 to 0.765
To illustrate with data, read the Jackets data file included
with lessR into the data frame d. The file
contains two categorical variables. The variable Bike
represents two different types of motorcycle: BMW and Honda. The second
variable is Jacket with three values of jacket thickness: Lite,
Med, and Thick. Because d is the default name of the data frame
that contains the variables for analysis, the data
parameter that names the input data frame need not be specified.
<- Read("Jackets") d
##
## >>> Suggestions
## Details about your data, Enter: details() for d, or details(name)
##
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
##
## Variable Missing Unique
## Name Type Values Values Values First and last values
## ------------------------------------------------------------------------------------------
## 1 Bike character 1025 0 2 BMW Honda Honda ... Honda Honda BMW
## 2 Jacket character 1025 0 3 Lite Lite Lite ... Lite Med Lite
## ------------------------------------------------------------------------------------------
In following example, for the variable
Bike
from the default d data frame, define the parameter
success
as the value “BMW”. The default null
hypothesis is a population value of 0.5, but here explicitly specify
with the parameter pi
.
For clarity, the following example includes the parameter names
listed with their corresponding values. These names are unnecessary in
this example, however, because the values are listed in the same order
of their definition of the Prop_test()
function.
Prop_test(variable=Bike, success="BMW", pi=0.5)
##
## <<< Exact binomial test of a proportion
##
## variable: Bike
## success: BMW
##
## ------ Describe ------
##
## Number of missing values: 0
## Number of successes: 418
## Number of failures: 607
## Number of trials: 1025
## Sample proportion: 0.408
##
## ------ Infer ------
##
## Hypothesis test for null of 0.5, p-value: 0.000
## 95% Confidence interval: 0.378 to 0.439
Reject the null hypothesis, with a \(p\)-value of 0.000, less than \(\alpha = 0.05\). The sample result of the
sample proportion \(p=0.408\) is
considered far from the hypothesized value of \(0.5\) for the proportion of
"BMW"
values for Bike. Conclude that the data were
sampled from a population with a population proportion of BMW different
from 0.5.
The following example is from the base R prop.test()
documentation, which the lessR Prop_test()
relies upon to compare proportions across different groups.
The null hypothesis in this example is that the four populations of patients from which the samples were drawn have the same population proportion of smokers. The alternative is that at least one population proportion is different. Label the groups in the output by providing a named vector for the successes.
To indicate multiple proportions across groups, provide multiple
values for the n_succ
and n_tot
parameters.
Optionally, name the groups.
<- c(83, 90, 129, 70)
smokers names(smokers) <- c("Group1","Group2","Group3","Group4")
<- c(86, 93, 136, 82)
patients Prop_test(n_succ=smokers, n_tot=patients)
##
## <<< 4-sample test for equality of proportions without continuity correction
##
##
## --- Description
##
## Group1 Group2 Group3 Group4
## ----------- ------- ------- ------- -------
## n_ 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## --- Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
The result of the test is \(p\)-value \(=0.006 < \alpha=0.05\), so reject the null hypothesis of equal probabilities across the corresponding four populations. Conclude that at least one of the population proportions of smokers differ.
In the following example, duplicate the previous results, but in this example from data. To illustrate, create the data frame d according to the proportions of smokers and non-smokers with respective values “smoke” and “nosmoke”. Of course, in actual data analysis the data would already be available.
<- c(rep("smoke", 83), rep("nosmoke", 3))
sm1 <- c(rep("smoke", 90), rep("nosmoke", 3))
sm2 <- c(rep("smoke", 129), rep("nosmoke", 7))
sm3 <- c(rep("smoke", 70), rep("nosmoke", 12))
sm4 <- c(sm1, sm2, sm3, sm4)
sm <- c(rep("A",86), rep("B",93), rep("C",136), rep("D",82))
grp <- data.frame(sm, grp) d
To test if the different groups have the same population proportion
of success
, retain the syntax for a single proportion for
the categorical variable
of interest. Define success by the
value of this variable, here “smoke”. However, an additional
parameter by
indicates the variable that defines the
groups, a variable that contains a label that identifies the
corresponding group for each row of data. The grouping variable in this
example is grp, with values the first four uppercase letters of
the alphabet. The first five rows of data are shown below.
head(d)
## sm grp
## 1 smoke A
## 2 smoke A
## 3 smoke A
## 4 smoke A
## 5 smoke A
## 6 smoke A
The relevant parameters variable
, success
,
and by
are listed in their given order in this example, so
the parameter names are unnecessary. List the names for clarity.
Prop_test(variable=sm, success="smoke", by=grp)
##
## <<< 4-sample test for equality of proportions without continuity correction
##
## variable: sm
## success: smoke
## by: grp
##
## --- Description
##
## A B C D
## ----------- ------ ------ ------ ------
## n_smoke 83 90 129 70
## n_total 86 93 136 82
## proportion 0.965 0.968 0.949 0.854
##
## --- Inference
##
## Chi-square statistic: 12.600
## Degrees of freedom: 3
## Hypothesis test of equal population proportions: p-value = 0.006
The analysis of data that matches the previously input proportions, of course, provides the same results as providing the proportions directly.
For the previously discussed test of homogeneity of the values of a single categorical variable, the proportion of occurrences for a specific value across different samples is of interest. Here, instead calculate the proportion of occurrence for each value from the total number of occurrences, as one sample from a single population. In addition to the inference test, the following are also reported: - The observed and expected frequencies - The residual of expected from observed - The standardized version of the residual
For the goodness-of-fit test to a uniform distribution, provide the
frequencies for each group for the parameter n_tot
. The
default null hypothesis is that the proportions of the different
categories of a categorical variable are equal.
In this example, enter three frequencies as a vector for the
n_tot
parameter value. Optionally, make the vector a named
vector to label the output accordingly.
= c(372, 342, 311)
x names(x) = c("Lite", "Med", "Thick")
Prop_test(n_tot=x)
##
## <<< Chi-squared test for given probabilities
##
##
## --- Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## --- Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
This example does not quite attain significance at the customary 5% level, with \(p\)-value \(= 0.066 > \alpha = 0.05\). A difference of the corresponding population proportions was not detected.
The same analysis follows from the data. Just specify the name of the
categorical variable
of interest.
<- Read("Jackets", quiet=TRUE) d
Prop_test(Jacket)
##
## <<< Chi-squared test for given probabilities
##
## variable: Jacket
##
## --- Description
##
## Lite Med Thick
## --------- -------- -------- --------
## observed 372 342 311
## expected 341.667 341.667 341.667
## residual 1.641 0.018 -1.659
## stdn res 2.010 0.022 -2.032
##
## --- Inference
##
## Chi-square statistic: 5.446
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.066
Tests of independence evaluated here rely upon a contingency table of two dimensions also called a cross-tabulation table or joint frequency table. Enter the joint frequencies directly or compute from the data. The corresponding analysis provides the chi-square test for the null hypothesis of independence.
Also provided is Cramer’s V to indicate the extent of the relationship of the two categorical variables. For each cell frequency, the expected value given the independence assumption is provided, along with the corresponding residual from the observed frequency and the corresponding standardized residual.
To enter the joint frequency table directly, store the frequencies in
a file accessible from your computer system. One possibility is to enter
the numbers into a text file with file type .csv
or
.txt
. Enter the numbers with a text editor, or with a word
processor saving the file as a text file. This file format separates the
adjacent values in each row with a comma, as indicated below. Or, enter
the numbers into an MS Excel formatted file with file type
.xlsx
. Enter only the numeric frequencies, no labels.
For example, consider the following joint frequency table with four
levels of the column variable and four levels of the row variable, here
in csv
format.
3,58,6,105
41,79,9,207
86,179,27,484
143,214,31,824
After saving the file, call Prop_test()
using the
parameter n_table
to indicate the path name to the file,
enclosed in quotes. Or, leave the quotes empty to browse for the joint
frequency table.
This table is included in a file downloaded with lessR with the name FreqTable99. That name triggers an internal process that locates the file within the lessR installation without needing to construct a rather complicated path name as part of this example. That also means that the name becomes a reserved key word with its use always triggering the following example.
In general, replace FreqTable99 in this example with your own path name to your file of joint frequencies, or just delete the name leaving only the two quotes to indicate to browse for the file.
Prop_test(n_table="FreqTable99")
##
## <<< Pearson's Chi-squared test
##
## --- Description
##
## Cell Frequencies
## 3 58 6 105
## 41 79 9 207
## 86 179 27 484
## 143 214 31 824
##
## Cramer's V: 0.075
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 3 18.812 -15.812 -4.003
## 1 2 58 36.522 21.478 4.150
## 1 3 6 5.030 0.970 0.455
## 1 4 105 111.635 -6.635 -1.098
## 2 1 41 36.750 4.250 0.799
## 2 2 79 71.346 7.654 1.098
## 2 3 9 9.827 -0.827 -0.288
## 2 4 207 218.077 -11.077 -1.361
## 3 1 86 84.875 1.125 0.156
## 3 2 179 164.776 14.224 1.504
## 3 3 27 22.696 4.304 1.105
## 3 4 484 503.654 -19.654 -1.781
## 4 1 143 132.562 10.438 1.339
## 4 2 214 257.356 -43.356 -4.246
## 4 3 31 35.447 -4.447 -1.057
## 4 4 824 786.635 37.365 3.135
##
## --- Inference
##
## Chi-square statistic: 41.732
## Degrees of freedom: 9
## Hypothesis test of equal population proportions: p-value = 0.000
Do not have the path name to your file readily available? Then browse for the file. The following example is not run as it cannot run in this vignette.
Prop_test(n_table="")
The full path name for the file is provided as part of the output.
The \(\chi^2\) test of independence
evaluated here applies to two categorical variables. The first
categorical variable listed in this example is the value of the
parameter variable
, the first parameter in the function
definition, so does not need the parameter name. The second categorical
variable listed must include the parameter name by
.
The question for the analysis is if the observed frequencies of Jacket thickness and Bike ownership sufficiently differ from the frequencies expected by the null hypothesis that we conclude the variables are related.
Prop_test(Jacket, by=Bike)
## variable: Jacket
## by: Bike
##
## <<< Pearson's Chi-squared test
##
## --- Description
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## Row Col Observed Expected Residual Stnd Res
## 1 1 89 151.703 -62.703 -8.288
## 1 2 135 139.469 -4.469 -0.602
## 1 3 194 126.827 67.173 9.287
## 2 1 283 220.297 62.703 8.288
## 2 2 207 202.531 4.469 0.602
## 2 3 117 184.173 -67.173 -9.287
##
## --- Inference
##
## Chi-square statistic: 104.083
## Degrees of freedom: 2
## Hypothesis test of equal population proportions: p-value = 0.000
The result of this test is that the \(p\)-value = 0.000 \(< \alpha=0.05\), so reject the null hypothesis of independence. Conclude that the type of Bike a person rides and the thickness of their Jacket are related.
To visualize the relationship of the two variables, use the same
function call syntax, but now to BarChart()
instead of
Prop_test()
. The visualization is accompanied by the same
\(\chi^2\) test of independence.
BarChart(Jacket, by=Bike)
## >>> Suggestions
## Plot(Jacket, Bike) # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE) # horizontal bar chart
## BarChart(Jacket, fill="steelblue") # steelblue bars
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## Chi-square Test: Chisq = 104.083, df = 2, p-value = 0.000
The visualization depicts the relationship between motorcycle and jacket: Honda riders prefer thinner jackets, and BMW riders prefer thicker jackets. To speculate, perhaps because the BMW bikes are sportier, their riders are more concerned with going down on the pavement.
This relationship becomes even clearer to visualize with the corresponding 100% stack bar graph. Each bar representing a jacket choice in this visualization shows the percentage of riders with each type of motorcycle for that jacket.
BarChart(Jacket, by=Bike, stack100=TRUE)
## >>> Suggestions
## Plot(Jacket, Bike) # bubble plot
## BarChart(Jacket, by=Bike, horiz=TRUE) # horizontal bar chart
## BarChart(Jacket, fill="steelblue") # steelblue bars
##
## Joint and Marginal Frequencies
## ------------------------------
##
## Jacket
## Bike Lite Med Thick Sum
## BMW 89 135 194 418
## Honda 283 207 117 607
## Sum 372 342 311 1025
##
## Cramer's V: 0.319
##
## Chi-square Test: Chisq = 104.083, df = 2, p-value = 0.000
##
## Cell Proportions within Each Column
## -----------------------------------
##
## Jacket
## Bike Lite Med Thick
## BMW 0.239 0.395 0.624
## Honda 0.761 0.605 0.376
## Sum 1.000 1.000 1.000
From this visualization we see that 24% of Lite jacket owners are BMW riders, and, in contrast, 62% of the owners of Heavy jackets are BMW riders.