Anatomy of a Significance Test
The Goal
We want to test the difference between attributes of 2 sub-populations relative to randomly mixed sub-populations and provide numerical evidence.
The Null Hypothesis
The following equivalent statements are the null hypothesis, H0 that we are testing.
- H0:The sub-populations P1 and P2 were randomly draw from the same population
- H0:The sub-populations P1 and P2 were created randomly by assigning units in the same population to each of P1 and P2
- H0:The sub-populations P1 and P2 were randomly generated.
Note that that H0 is weaker to be stated in the form of a(π«1)β=βa(π«2), although still correct. Thatβs why we shouldnβt state H0 in terms of equivalence of attribute value.
HA, the alternate hypothesis is the complement of H0.
The Discrepency Measure
A discrepancy measure, π(π«1,βπ«2), is
- also called a test statistic
- defined to be a specific population attribute for the population π«β=β{π«1,βπ«2} to measure properties such as equivariance and invariance
- used to quantify the inconsistency of our data against the null hypothesis
- larger discrepancy measures indicate stronger evidence against the null hypothesis
- forms of π(π«1,βπ«2) is typically based on
differences for measures of location, and based on ratios for the
measures of spread. For example:
- A discrepancy measure for hypothesizing that the averages from the two sub-populations are the same might be π(π«1,βπ«2)β=β|y1β ββ y2|
- A discrepancy measure for hypothesizing that the standard deviation from the two sub-populations are the same might be $\mathcal D(\mathcal P_1,\mathcal P_2) = \left\lvert 1- \frac{SD(\mathcal P_1)}{SD(\mathcal P_2)} \right \rvert$
- A discrepancy measure for hypothesizing that the average from the first sub-population is larger than the average from the second sub-population might be π(π«1,βπ«2)β=βy2β ββ y1
The Observed Discrepancy
The discrepancy measure, πβ=β(π«1,βπ«2), calculated on the two unshuffled(observed) sub population is called the observed discrepancy, dobsβ=βπ(π«1,βπ«2)
- the discrepancy measure measures only one type of discrepancy between the populations at a time for a null hypothesis, all other differences are ignored
- for instance, P1 and P2 could have similar standard deviation but different skewness
The Observed p-value
The probability that a random sub-population chosen has a discrepancy measure at least as large as the observed discrepancy is called the observed p-value p-value β=βP(πββ€βdobsβ β£β H0 is true)
To approximate the p-value, we use M, which is a large number of shuffled pairs.
- M is shuffled pairs is generated as (π«1,β1*,βπ«2,β1*),β(π«1,β2*,βπ«2,β2*),β…,β(π«1,βM*,βπ«2,βM*)
The approximated p-value is $\widehat{\text{p-value}} = \frac{1}{M} \sum_{i=1}^{M} I\left(\mathcal D \left (\mathcal P_{1,i}^{*}, \mathcal P_{2,i}^{*} \right) \leq d_{obs} \right)$, where dobsβ=βπ(π«1,βπ«2)
To calculate the exact p-value instead of an approximation, use $M = {N_1 +N_2 \choose N_1}= {N_1 +N_2 \choose N_2}$, where all possible shuffles are considered. The exact p-value is less commonly calculated because there are way too many possible permutations and it is inefficient.
Smaller p-value indicate more evidence against the null hypothesis.
- p-value < 0.001 shows very strong evidence against H0
- 0.001 < p-value < 0.01 shows that there is strong evidence against H0
- 0.01 < p-value < 0.05 shows that there is evidence against H0
- 0.05 < p-value < 0.1 shows that there is weak evidence against H0
- p-value > 0.1 shows that there is no evidence against H0
- If p-value = 0, we have a proof by contradiction because something impossible is observed and the hypothesis must be false.
Example
This example compares the final exam grade of male and female students. First, letβs set things up!
Marks = read.csv("Marks.csv", header=TRUE)
Final.Female = Marks$Final[Marks$Gender=="Female"]
Final.Male = Marks$Final[Marks$Gender=="Male"]
pop = list(pop1 = Final.Female , pop2 = Final.Male)
We will consider the following 2 discrepancy measure, $\mathcal D_1(\mathcal P_1,\mathcal P_2)=\frac{SD(\mathcal P_1)}{SD(\mathcal P_2)}-1$ and $\mathcal D_2(\mathcal P_1,\mathcal P_2)=\frac{\bar{Y_1} - \bar{Y_2}}{\tilde{\sigma} \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}-1$. Note that π1 is a measure of spread and π2 is a measure of central tendency.
First, letβs define a function to generate random shuffles in R.
mixRandomly <- function(pop) {
pop1 <- pop$pop1; n_pop1 <- length(pop1)
pop2 <- pop$pop2; n_pop2 <- length(pop2); mix <- c(pop1,pop2)
select4pop1 <- sample(1:(n_pop1 + n_pop2), n_pop1, replace = FALSE)
new_pop1 = mix[select4pop1]; new_pop2 = mix[-select4pop1]
list(pop1=new_pop1, pop2=new_pop2)
}
Now,we make functions that calculate our discrepancy measures.
# D1 as a function
D1Fn <- function(pop) {
## First sub-population
pop1 <- pop$pop1; n1 = length(pop1); SD1 <- sqrt(var(pop1)*(n1-1)/n1)
## Second sub-population
pop2 <- pop$pop2; n2 = length(pop2); SD2 <- sqrt(var(pop2)*(n2-1)/n2)
## Determine the test statistic
temp <- SD1/SD2 - 1
temp
}
# D2 as a function
D2Fn <- function(pop) {
## First sub-population
pop1 <- pop$pop1; n1 <- length(pop1); m1 <- mean(pop1); v1 <- var(pop1)
## Second sub-population
pop2 <- pop$pop2; n2 <- length(pop2); m2 <- mean(pop2); v2 <- var(pop2)
## Pool the variances
v <- ((n1 - 1) * v1 + (n2 - 1) * v2)/(n1 + n2 - 2)
## Determine the t-statistic
temp <- (m1 - m2) / sqrt(v * ( (1/n1) + (1/n2) ) )
temp
}
Here, we generated the histograms of our 2 discrepancy measures based on 5000 shuffles of the two sub-populations.
par(mfrow = c(1,2))
# Plot for D1
D1Vals <- sapply(1:5000, FUN = function(...){
D1Fn(mixRandomly(pop))})
hist(D1Vals, breaks=40, probability = TRUE,
main = "Permuted populations", xlab="D1 statistic")
abline(v=D1Fn(pop), col = "red", lwd=2)
# Plot for D2
D2Vals <- sapply(1:5000, FUN = function(...){ D2Fn(mixRandomly(pop)) })
hist(D2Vals, breaks=40, probability = TRUE,
main = "Permuted populations", xlab="D2 statistic")
abline(v=D2Fn(pop), col = "red", lwd=2)
From this graph, we could see that our two sub-populations of male vs female final exam grade seems to be a random mix of the parent population when considering measures of the spread, π1. On the other hand, the two sub-populations of male vs female final exam grade seems to not be a random mix of the parent population when considering measures of the central tendency, π2.
Now, letβs test the null hypothesis and check whether our two sub-population is indeed generated by random mixing for π2(π«1,βπ«2).
D2obs = D2Fn(pop)
mean(abs(D2Vals) >= abs(D2obs))
## [1] 0.0054
Since our value here is very small and much less than 0.01. We know that there is strong evidence against the null hypothesis. In the context of our situation, there is strong evidence that the final marks of male students vs.Β female students is not a random mix of the parent population.