Short lecture on methods to detect population structure.

Estimate diversity

It is usually interesting to understand the genetic diversity of your population. This can provide insight into bottlenecks, selection, and so on.

We will calculate Pi for our populations (remember from lecture..). There are numerous way we could do this, but we will use vcftools. One important thing to note about this method is that it assumes all sites not in our vcf file are invariant. That is, there is an assumption that we have sequenced the whole genome, which is not true in our case. For all methods that estimate pi, it is very important that you know how your vcf is being handled and invariant sites are considered.

We will use 50kb windows to calculate pi. You could alter this window size and it would change your results. We’re using a windowed approach because 1. per site estimates are noisy; 2. We are generally interested in the whole genome pi, which a large window will estimate well; 3. It produces less output so is easier to deal with.

We can first make a new directory where we will analyze our population structure data:

VCF=~/shared_materials/tutorial_files/variants.vcf

# under ~/my_materials
mkdir diversity

cd diversity

In the loop below, we will estimate pi for each population.

for pop in TR PC BC HP GA;
do

vcftools --vcf $VCF \
--keep ~/shared_materials/tutorial_files/${pop}.pop \
--window-pi 50000 \
--out ${pop}_pi_50kb;

done

This loop is saying, for each of our population id’s (TR PC BC HP PL GA), do the following. Where we have ${pop} in the loop, the population id of that step is inserted. First, TR, then PC, and so on. The --keep option is telling vcftools which individuals to keep (specified in a text file), window-pi allows us to set the window size and tell vcftools to calculate pi, and the --out option is our output filename.

Run the loop above.

Look at your results and we will discuss together.

processing output

We next need to process the output from vcftools. We can do this in R by reading all of our files into R in a list, then plotting the results.

library(tidyverse)

# First, get the file locations and names.
# set the directory where the files are located. 
dir <- "~/my_materials/diversity"
# generate the paths to the files
files <- file.path(dir, list.files(dir))
# drop the summary files, we only want the actual pi estimates
files <- files[grep("windowed.pi", files)]
# look at your file locations- are they correct?
files

# read in files
# lapply lets us apply the same function to all of our files. It will output a list where each item is one of our files.
d <- lapply(files,
            FUN = function(files) {
              read.table(files, header = TRUE)
            })
# but the list has no names, so we need to fix this:
names(d)
d <- setNames(d,(str_split_fixed(files, "/", 5)[,4] %>% str_split_fixed( "[_]", 3))[,1])
names(d)

# you can now access each item in the in two ways. 
## 1: by index
head(d[[1]])

## 2: by name
head(d[['BC']])

# However, if you know R at all, having dataframes in different lists isn't super great (especially if we want to use ggplot). So we need to merge them together.
## we can use the bind_rows function, that will paste the dataframes together by row and add a column that we tell it, in this case id that is based on the name of the dataframes.
meltdat <- bind_rows(d, .id="id")

# we now have a big dataframe of all of our data.

How would you check that your bind_rows function worked correctly and your final dataframe is complete?

Pi for each population

library(ggplot2)
library(ggridges)

# we can first plot density plots of the estimates

ggplot( meltdat, aes(x=PI, y=id, fill=id)) + 
  geom_density_ridges(quantile_lines=TRUE, quantiles=2) +
  theme_classic()


# and we can grab means for each
meltdat %>%
  group_by(id) %>%
  summarize(mean = mean(PI), sd = sd(PI), median = median(PI))

Do you think there are differences in pi between the populations?

How might you add more confidence to your conclusion?

What effect do you think window size has on your estimates? Go back and change this value to see what happens.

PCA

One of the most basic and fundamental tools to understand population structure is principal component analysis, also known as PCA.

To goal of PCA is to represent the main axes of variation in a dataset. You can think of PCA as giving us the “big picture” view of our data.

One important assumption of a PCA is that there is no linkage in our data. In other words, our SNPs need to be independent. We know in most datasets, snps that are close together, or linked due to other causes, are correlated. We need to “prune” for LD. We can use plink to do this.

Plink is a powerful program that was designed for human genomics. This means that is assume our data are fancy and clean and in very specific formats- because of this, we need to add additional parameters to our commands, which you’ll see below.

Prune for LD

Run the code below


# first make a new directory called population_structure
## run all calculations below in this directory

plink --vcf $VCF \
--indep-pairwise 50 5 0.2 --allow-extra-chr --double-id \
--out variants_pruned

Here, we specify our input file with --vcf, our output file name with --out, and --allow-extra-chr and --double-id help to deal with our non-human data.

The real action is happening with the --indep-pairwise command. The first value is the window-size, in this case 50 snps. The second number tells us how far to move the window each step (5 SNPs), and the third value tells use the LD threshold to use to remove a SNP (r-squared value). The values we use here are the most commonly used. See here for more information.

This command will output a two lists:

variants_pruned.prune.in: the set of LD pruned SNPs
variants_pruned.prune.out: the SNPs removed from the dataset.

Now we have a list of the LD pruned snps, we need to make the pruned file that plink needs.

plink --vcf $VCF \
--extract variants_pruned.prune.in \
--make-bed --out variants_NoLD \
--allow-extra-chr --double-id

Here we tell plink to extract from our VCF tile only the LD pruned variants in our variants_pruned.prune.in file. We tell it to make a bed file and other associated files that are specific to Plink which gives us the following:

variants_NoLD.bed: a binary file that records genotypes in 0 and 1.
variants_NoLD.bim: a map file that tells the details of the variants
variants_NoLD.fam: a fam file that tells us details of the individuals in the dataset.

run the pca

Now that we have our LD pruned plink files, we can run our actual PCA, again in plink.

plink --bfile variants_NoLD \
  --pca --out variants_NoLD_PCA --allow-extra-chr --double-id

Where --bfile says to read in the plink files in binary format with our specific name and pca is hopefully self explanatory.

This will ourput two files:

variants_NoLD_PCA.eigenval: the variation that is explained by each principal component (PC)
variants_NoLD_PCA.eigenvec: the principal component values for each sample for each PC

plot the pca

library(ggplot2)

dat <- read.table("variants_NoLD_PCA.eigenvec", header=F)
eigenval <- read.table("variants_NoLD_PCA.eigenval", header=F)

# first convert to percentage variance explained
pve <- data.frame(PC=1:20, pve=round(eigenval$V1/sum(eigenval$V1)*100,1))

# calculate the cumulative sum of the percentage variance explained
cumsum(pve$pve)

# plot the PC's
a <- ggplot(pve, aes(PC, pve)) + geom_bar(stat = "identity") + 
  ylab("Percentage variance explained") + 
  theme_classic()
a

How many PC’s do you think we should use with our data?

####################
# plot the PCA
####################

# rename our columns, just for ease
colnames(dat) <- c("ID", "ID2", "PC1", "PC2", "PC3", "PC4", colnames(dat)[7:ncol(dat)])

# add a population label:
dat$population <- substr(dat$ID, 1,2)

# plot the PCA

d <- ggplot(dat, aes(PC1, PC2, fill=population)) +
  geom_point(size=4.5, shape=21, color="black") +
  xlab(paste0("PC1: ",pve$pve[1],"% variance")) +
  ylab(paste0("PC2: ",pve$pve[2],"% variance")) +
  theme_bw() +
  scale_fill_manual(values=c("#68228B",'#B22222',"#CD6090","#87CEFA", "#1874CD"))
d

# optional, output your pca to a pdf
#ggsave("pca.pdf",d, w=5, h=3.7)

Change the code above to plot PC1 and PC3. What do you find and do you think this makes sense?
What might this tell us about the population structure?

Admixture/STRUCTURE

Another classic, powerful, and another one of the most popular analysis in population genetics is STRUCTURE, which was published in 2000 and since has accumulated >36,000 citations. Structure’s is a model based method that infers population structure. The goal is to cluster individuals into hypothetical ancestral populations in a manner that individuals within a population are in Hardy-Weinberg equilibrium and linkage equilibrium. The program is iteratively assigning individuals to populations until it converges on population assignments that have the most likely allele frequency compositions.

However, STRUCTURE is slow and numerous other methods have been developed that do the same thing in a much faster way. One is ADMIXTURE, which we will use here. See the manual here

Running ADMIXTURE

Admixture requires LD pruned variants in plink ormat, which we already generated above for our PCA.

Running admixture is relatively easy and short:


admixture --cv variants_NoLD.bed 2 | tee log_2.out

Here, we call the program with admixture, we use --cv to say we want to calculate the cross validation score (we’ll get to this later), we specify our bed file, and we tell the programs how many populations to assume. The | and tee just outputs the stdout to a file that we can save.

I’m sure you remember from lecture, but we need to run this for multiple values of K in order to determine the most likely number of populations.

What are sensible values of K to run?

We could just copy the code above for multiple K’s, but this can be done more elegantly below:


for K in 1 2 3 4 5; \
do admixture --cv variants_NoLD.bed $K | tee log${K}.out; done

and we need to grab the cross validation scores, to determine the most likely K:


grep -h CV log*.out | cut -f 3- -d " " > cv.txt

grep is saying to find lines with CV in our log files and print them. We then drop the first few columns so it is easier to plot next. We will come back to these shortly.

### plot the ADMIXTURE results

library(tidyverse)

samplelist <- read_tsv("~/shared_materials/tutorial_files/indivs.txt",
                       col_names = "individual")

# we could read in one data frame at a time:
read_delim("~/my_materials/population_structure/variants_NoLD.2.Q",
           col_names = paste0("Q",seq(1:2)),
           delim=" ")


# read in all date, in a loop
## first create an empty dataframe
all_data <- tibble(individual=character(),
                   k=numeric(),
                   Q=character(),
                   value=numeric())

# then add all results to this
for (k in 1:5){
  data <- read_delim(paste0("~/my_materials/population_structure/variants_NoLD.",k,".Q"),
                     col_names = paste0("Q",seq(1:k)),
                     delim=" ")
  data$sample <- samplelist$individual
  data$k <- k
  #This step converts from wide to long.
  data %>% gather(Q, value, -sample,-k) -> data
  all_data <- rbind(all_data,data)
}

# add the population label
all_data$population <- substr(all_data$sample, 1, 2)
all_data$population <- factor(all_data$population, 
                              levels=c("GA", "PL", "HP", "BC", "PC", "TR"))

# our orders are off in our vcf. lets re-order these from south to north. 
orderlist <- read_tsv("~/shared_materials/tutorial_files/population_order.txt",
                      col_names = "sample")
all_data$sample<-factor(all_data$sample,levels=orderlist$sample)

# first, only plot k=2

all_data %>%
  filter(k == 2) %>%
  ggplot(.,aes(x=sample,y=value,fill=factor(Q))) + 
  geom_rug(aes(x=sample, y=value, color=population)) +
  geom_bar(stat="identity",position="stack") +
  xlab("Sample") + ylab("Ancestry") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_fill_brewer(palette="Set1",name="K",                    
                    labels=c("1","2"))

Stop, and we will discuss this result

# plot all k values.
p <-  ggplot(all_data,aes(x=sample,y=value,fill=factor(Q))) + 
  geom_bar(stat="identity",position="stack") +
  geom_rug(aes(x=sample, color=population), inherit.aes=F) +
  xlab("Sample") + ylab("Ancestry") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  scale_fill_brewer(palette="Set1",name="K",
                    labels=seq(1:5)) +
  facet_wrap(~k,ncol=1)
p
# ggsave("Admixture_plot.pdf", p, width = 7, height = 15, units="in")

### what is our most likely k?

# read in CV scores:
cvin <- read.csv("~/my_materials/population_structure/cv.txt", sep=":", header=F)
colnames(cvin) <- c("id", "cv")
# fix the formatting to get K into numeric format
cvin$K <- substr(cvin$id, 4, 4)

# plot the results
ggplot(cvin,aes(x=K,y=cv)) +
  geom_point(size=3)  + geom_line(group=1)

What is our most likely K?
Now we can talk biology… does any of this make sense?

Fst and isolation by distance

We can also calculate genome-wide Fst to understand how population may be structured. We could do this with vcftools, but we’ll use the R package snpR instead. Note that if you have a large vcf file, you’ll need a lot of memory to run this in R- you’d be better off with vcftools.

library(snpR)

# read in data file:
dat <- read_vcf("~/shared_materials/tutorial_files/variants.vcf")

# add meta data information:
## population
sample_meta <- data.frame(pop = substr(colnames(dat), 1, 2))
## order the population
sample_meta$pop <- factor(sample_meta$pop, levels=c("GA", "HP", "BC", "PC", "TR")) 

# assign meta data to dat
sample.meta(dat) <- sample_meta

# what does the actual data look like?
genotypes(dat)[1:6, 1:6]

# calculate fst between the populations
my.dat <- calc_pairwise_fst(dat, facets="pop", method = "WC")

Now that we’ve estimated Fst, we can look at our results:

# first look at the results, just the head
head(get.snpR.stats(my.dat, facets = "pop", stats = "fst"))

# heatmap of the fst estimates:
plot_pairwise_fst_heatmap(my.dat, facets="pop")

Again, does this make sense biologically?

We can also ask if our populations follow isolation by distance.

# pull out the fst estimates first:
fst_out <- get.snpR.stats(my.dat, facets = "pop", stats = "fst")$weighted.means

# I have pre-calculated the ditances between the populations, found in the following file.
distances <- read.csv("~/shared_materials/tutorial_files/distances.csv", header=T)
## the distances are in km

# merge our dataframes:
dist.df <- merge(distances, fst_out, by="subfacet")

Plot your data to see if there is isolation by distance.

p <-  ggplot(dist.df,aes(x=distance,y=weighted_mean_fst)) + 
  geom_point() +
  theme_classic()
p

# and we can add the population labels
p <- p + geom_text(label=dist.df$subfacet)

How would you add statistical strength to your conclusion?

# then add a regression line:
p + geom_smooth(method=lm)

I hope you’ll see that these results look a little funny. Can you explain what is going on here?

Population structure