Multiple testing example

Simulate a population with some phenotype

population <- rnorm(100000, mean=20, sd=1)

hist(population)

From our true population, we can draw samples. Imagine you are collecting samples from the wild. You only collect some small subset of all individuals. Imagine we collect two sets of samples of 100 and want to know if they’re different from one another. So we run a t-test.

x <- sample(population, 100, replace = FALSE)
y <- sample(population, 100, replace = FALSE)
t.test(x=x, y=y)$p.value

## [1] 0.7803207

hist(population, freq=F, ylim=c(0,0.6), col="white")
hist(x, col=alpha(alpha=0.3,"red"), add=T, freq=F)
hist(y, col=alpha(alpha=0.3,"blue"), add=T, freq=F)

boxplot(x, population, y, names=c("sample X", "Total population", "sample y"))

As expected, our samples largely overlap and our pvalue is not significant.

But what if we do this a lot of times? Draw 10,000 samples from this population and run 10,000 t-tests?

pval <- rep(NA, 1e4)
sigout <- list()
for(i in 1:1e4)
{
  x <- sample(population, 100, replace = FALSE)
  y <- sample(population, 100, replace = FALSE)
  pval[i] <- t.test(x=x, y=y)$p.value
  if(pval[i] < 0.05){
    sigout[[length(sigout)+1]] <- data.frame(x=x, y=y)
  }
}

sum(pval < 0.05)

## [1] 523

length(pval)

## [1] 10000

sum(pval < 0.05)/10000

## [1] 0.0523

We just pulled random samples from the same population. They should be the same. But 5% of our tests are significant!

Here is one example of a significant set of samples:

# plot histograms
hist(population, freq=F, ylim=c(0,0.6), col="white")
hist(sigout[[1]]$x, col=alpha(alpha=0.3,"red"), add=T, freq=F)
hist(sigout[[1]]$y, col=alpha(alpha=0.3,"blue"), add=T, freq=F)

boxplot(sigout[[1]]$x, population, sigout[[1]]$y, names=c("sample X", "Total population", "sample y"))

Thinking about p-values, this is exactly what we should see. But it means that if we run many tests, we will find positives just by chance. This is why we need to correct for multiple testing.