Anatomy of a Significance Test

The Goal

We want to test the difference between attributes of 2 sub-populations relative to randomly mixed sub-populations and provide numerical evidence.

The Null Hypothesis

The following equivalent statements are the null hypothesis, H0 that we are testing.

H0:The sub-populations P1 and P2 were randomly draw from the same population
H0:The sub-populations P1 and P2 were created randomly by assigning units in the same population to each of P1 and P2
H0:The sub-populations P1 and P2 were randomly generated.

Note that that H0 is weaker to be stated in the form of a(𝒫1) = a(𝒫2), although still correct. That’s why we shouldn’t state H0 in terms of equivalence of attribute value.

HA, the alternate hypothesis is the complement of H0.

The Discrepency Measure

A discrepancy measure, 𝒟(𝒫1, 𝒫2), is

also called a test statistic
defined to be a specific population attribute for the population 𝒫 = {𝒫1, 𝒫2} to measure properties such as equivariance and invariance
used to quantify the inconsistency of our data against the null hypothesis
larger discrepancy measures indicate stronger evidence against the null hypothesis
forms of 𝒟(𝒫1, 𝒫2) is typically based on differences for measures of location, and based on ratios for the measures of spread. For example:
- A discrepancy measure for hypothesizing that the averages from the two sub-populations are the same might be 𝒟(𝒫1, 𝒫2) = |y1 − y2|
- A discrepancy measure for hypothesizing that the standard deviation from the two sub-populations are the same might be $\mathcal D(\mathcal P_1,\mathcal P_2) = \left\lvert 1- \frac{SD(\mathcal P_1)}{SD(\mathcal P_2)} \right \rvert$
- A discrepancy measure for hypothesizing that the average from the first sub-population is larger than the average from the second sub-population might be 𝒟(𝒫1, 𝒫2) = y2 − y1

The Observed Discrepancy

The discrepancy measure, 𝒟 = (𝒫1, 𝒫2), calculated on the two unshuffled(observed) sub population is called the observed discrepancy, dobs = 𝒟(𝒫1, 𝒫2)

the discrepancy measure measures only one type of discrepancy between the populations at a time for a null hypothesis, all other differences are ignored
for instance, P1 and P2 could have similar standard deviation but different skewness

The Observed p-value

The probability that a random sub-population chosen has a discrepancy measure at least as large as the observed discrepancy is called the observed p-value p-value = P(𝒟 ≤ dobs ∣ H0 is true)

To approximate the p-value, we use M, which is a large number of shuffled pairs.
- M is shuffled pairs is generated as (𝒫1, 1*, 𝒫2, 1*), (𝒫1, 2*, 𝒫2, 2*), …, (𝒫1, M*, 𝒫2, M*)
The approximated p-value is $\widehat{\text{p-value}} = \frac{1}{M} \sum_{i=1}^{M} I\left(\mathcal D \left (\mathcal P_{1,i}^{*}, \mathcal P_{2,i}^{*} \right) \leq d_{obs} \right)$, where dobs = 𝒟(𝒫1, 𝒫2)
To calculate the exact p-value instead of an approximation, use $M = {N_1 +N_2 \choose N_1}= {N_1 +N_2 \choose N_2}$, where all possible shuffles are considered. The exact p-value is less commonly calculated because there are way too many possible permutations and it is inefficient.
Smaller p-value indicate more evidence against the null hypothesis.
- p-value < 0.001 shows very strong evidence against H0
- 0.001 < p-value < 0.01 shows that there is strong evidence against H0
- 0.01 < p-value < 0.05 shows that there is evidence against H0
- 0.05 < p-value < 0.1 shows that there is weak evidence against H0
- p-value > 0.1 shows that there is no evidence against H0
- If p-value = 0, we have a proof by contradiction because something impossible is observed and the hypothesis must be false.

Example

This example compares the final exam grade of male and female students. First, let’s set things up!

Marks = read.csv("Marks.csv", header=TRUE)
Final.Female = Marks$Final[Marks$Gender=="Female"]
Final.Male = Marks$Final[Marks$Gender=="Male"]
pop = list(pop1 = Final.Female , pop2 = Final.Male)

We will consider the following 2 discrepancy measure, $\mathcal D_1(\mathcal P_1,\mathcal P_2)=\frac{SD(\mathcal P_1)}{SD(\mathcal P_2)}-1$ and $\mathcal D_2(\mathcal P_1,\mathcal P_2)=\frac{\bar{Y_1} - \bar{Y_2}}{\tilde{\sigma} \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}-1$. Note that 𝒟1 is a measure of spread and 𝒟2 is a measure of central tendency.

First, let’s define a function to generate random shuffles in R.

mixRandomly <- function(pop) {
  pop1 <- pop$pop1;  n_pop1 <- length(pop1)
  pop2 <- pop$pop2; n_pop2 <- length(pop2); mix <- c(pop1,pop2)
  select4pop1 <- sample(1:(n_pop1 + n_pop2), n_pop1,  replace = FALSE)
  new_pop1 = mix[select4pop1]; new_pop2 = mix[-select4pop1]
  list(pop1=new_pop1, pop2=new_pop2)
}

Now,we make functions that calculate our discrepancy measures.

# D1 as a function
D1Fn <- function(pop) {
    ## First sub-population
    pop1 <- pop$pop1; n1 = length(pop1); SD1 <- sqrt(var(pop1)*(n1-1)/n1)
    ## Second sub-population
    pop2 <- pop$pop2; n2 = length(pop2); SD2 <- sqrt(var(pop2)*(n2-1)/n2)
    ## Determine the test statistic
    temp <- SD1/SD2 - 1
    temp
}

# D2 as a function
D2Fn <- function(pop) {
    ## First sub-population
    pop1 <- pop$pop1; n1 <- length(pop1); m1 <- mean(pop1); v1 <- var(pop1)
    ## Second sub-population
    pop2 <- pop$pop2; n2 <- length(pop2); m2 <- mean(pop2); v2 <- var(pop2)
    ## Pool the variances
    v <- ((n1 - 1) * v1 + (n2 - 1) * v2)/(n1 + n2 - 2)
    ## Determine the t-statistic
    temp <- (m1 - m2) / sqrt(v * ( (1/n1) + (1/n2) ) )
    temp
}

Here, we generated the histograms of our 2 discrepancy measures based on 5000 shuffles of the two sub-populations.

par(mfrow = c(1,2))

# Plot for D1
D1Vals <- sapply(1:5000, FUN = function(...){
  D1Fn(mixRandomly(pop))})
hist(D1Vals, breaks=40, probability = TRUE, 
     main = "Permuted populations", xlab="D1 statistic")
abline(v=D1Fn(pop), col = "red", lwd=2)

# Plot for D2
D2Vals <- sapply(1:5000, FUN = function(...){ D2Fn(mixRandomly(pop)) })
hist(D2Vals, breaks=40, probability = TRUE,  
     main = "Permuted populations", xlab="D2 statistic")
abline(v=D2Fn(pop), col = "red", lwd=2)

From this graph, we could see that our two sub-populations of male vs female final exam grade seems to be a random mix of the parent population when considering measures of the spread, 𝒟1. On the other hand, the two sub-populations of male vs female final exam grade seems to not be a random mix of the parent population when considering measures of the central tendency, 𝒟2.

Now, let’s test the null hypothesis and check whether our two sub-population is indeed generated by random mixing for 𝒟2(𝒫1, 𝒫2).

D2obs = D2Fn(pop) 
mean(abs(D2Vals) >= abs(D2obs))

## [1] 0.0054

Since our value here is very small and much less than 0.01. We know that there is strong evidence against the null hypothesis. In the context of our situation, there is strong evidence that the final marks of male students vs. female students is not a random mix of the parent population.

Anatomy of a Significance Test#

The Goal#

The Null Hypothesis#

The Discrepency Measure#

The Observed Discrepancy#

The Observed p-value#

Example#