Short lecture on methods to detect population structure.
It is usually interesting to understand the genetic diversity of your population. This can provide insight into bottlenecks, selection, and so on.
We will calculate Pi for our populations (remember from lecture..). There are numerous way we could do this, but we will use vcftools. One important thing to note about this method is that it assumes all sites not in our vcf file are invariant. That is, there is an assumption that we have sequenced the whole genome, which is not true in our case. For all methods that estimate pi, it is very important that you know how your vcf is being handled and invariant sites are considered.
We will use 50kb windows to calculate pi. You could alter this window size and it would change your results. We’re using a windowed approach because 1. per site estimates are noisy; 2. We are generally interested in the whole genome pi, which a large window will estimate well; 3. It produces less output so is easier to deal with.
We can first make a new directory where we will analyze our population structure data:
VCF=~/shared_materials/tutorial_files/variants.vcf
# under ~/my_materials
mkdir diversity
cd diversity
In the loop below, we will estimate pi for each population.
for pop in TR PC BC HP GA;
do
vcftools --vcf $VCF \
${pop}.pop \
--keep ~/shared_materials/tutorial_files/\
--window-pi 50000 ${pop}_pi_50kb;
--out
done
This loop is saying, for each of our population id’s
(TR PC BC HP PL GA
), do the following. Where we have
${pop}
in the loop, the population id of that step is
inserted. First, TR
, then PC
, and so on. The
--keep
option is telling vcftools which individuals to keep
(specified in a text file), window-pi
allows us to set the
window size and tell vcftools to calculate pi, and the
--out
option is our output filename.
Run the loop above.
Look at your results and we will discuss together.
We next need to process the output from vcftools. We can do this in R by reading all of our files into R in a list, then plotting the results.
library(tidyverse)
# First, get the file locations and names.
# set the directory where the files are located.
<- "~/my_materials/diversity"
dir # generate the paths to the files
<- file.path(dir, list.files(dir))
files # drop the summary files, we only want the actual pi estimates
<- files[grep("windowed.pi", files)]
files # look at your file locations- are they correct?
files
# read in files
# lapply lets us apply the same function to all of our files. It will output a list where each item is one of our files.
<- lapply(files,
d FUN = function(files) {
read.table(files, header = TRUE)
})# but the list has no names, so we need to fix this:
names(d)
<- setNames(d,(str_split_fixed(files, "/", 5)[,4] %>% str_split_fixed( "[_]", 3))[,1])
d names(d)
# you can now access each item in the in two ways.
## 1: by index
head(d[[1]])
## 2: by name
head(d[['BC']])
# However, if you know R at all, having dataframes in different lists isn't super great (especially if we want to use ggplot). So we need to merge them together.
## we can use the bind_rows function, that will paste the dataframes together by row and add a column that we tell it, in this case id that is based on the name of the dataframes.
<- bind_rows(d, .id="id")
meltdat
# we now have a big dataframe of all of our data.
How would you check that your bind_rows
function worked
correctly and your final dataframe is complete?
library(ggplot2)
library(ggridges)
# we can first plot density plots of the estimates
ggplot( meltdat, aes(x=PI, y=id, fill=id)) +
geom_density_ridges(quantile_lines=TRUE, quantiles=2) +
theme_classic()
# and we can grab means for each
%>%
meltdat group_by(id) %>%
summarize(mean = mean(PI), sd = sd(PI), median = median(PI))
Do you think there are differences in pi between the populations?
How might you add more confidence to your conclusion?
What effect do you think window size has on your estimates? Go back and change this value to see what happens.
One of the most basic and fundamental tools to understand population structure is principal component analysis, also known as PCA.
To goal of PCA is to represent the main axes of variation in a dataset. You can think of PCA as giving us the “big picture” view of our data.
One important assumption of a PCA is that there is no linkage in our data. In other words, our SNPs need to be independent. We know in most datasets, snps that are close together, or linked due to other causes, are correlated. We need to “prune” for LD. We can use plink to do this.
Plink is a powerful program that was designed for human genomics. This means that is assume our data are fancy and clean and in very specific formats- because of this, we need to add additional parameters to our commands, which you’ll see below.
Run the code below
# first make a new directory called population_structure
## run all calculations below in this directory
plink --vcf $VCF \
--allow-extra-chr --double-id \
--indep-pairwise 50 5 0.2 --out variants_pruned
Here, we specify our input file with --vcf
, our output
file name with --out
, and --allow-extra-chr
and --double-id
help to deal with our non-human data.
The real action is happening with the --indep-pairwise
command. The first value is the window-size, in this case 50 snps. The
second number tells us how far to move the window each step (5 SNPs),
and the third value tells use the LD threshold to use to remove a SNP
(r-squared value). The values we use here are the most commonly used.
See here for
more information.
This command will output a two lists:
variants_pruned.prune.in
: the set of LD pruned
SNPsvariants_pruned.prune.out
: the SNPs removed from the
dataset.Now we have a list of the LD pruned snps, we need to make the pruned file that plink needs.
plink --vcf $VCF \
\
--extract variants_pruned.prune.in --out variants_NoLD \
--make-bed --double-id --allow-extra-chr
Here we tell plink to extract
from our VCF tile only the
LD pruned variants in our variants_pruned.prune.in
file. We
tell it to make a bed
file and other associated files that
are specific to Plink which gives us the following:
variants_NoLD.bed
: a binary file that records genotypes
in 0 and 1.variants_NoLD.bim
: a map file that tells the details of
the variantsvariants_NoLD.fam
: a fam file that tells us details of
the individuals in the dataset.Now that we have our LD pruned plink files, we can run our actual PCA, again in plink.
plink --bfile variants_NoLD \
--pca --out variants_NoLD_PCA --allow-extra-chr --double-id
Where --bfile
says to read in the plink files in binary
format with our specific name and pca
is hopefully self
explanatory.
This will ourput two files:
variants_NoLD_PCA.eigenval
: the variation that is
explained by each principal component (PC)variants_NoLD_PCA.eigenvec
: the principal component
values for each sample for each PClibrary(ggplot2)
<- read.table("variants_NoLD_PCA.eigenvec", header=F)
dat <- read.table("variants_NoLD_PCA.eigenval", header=F)
eigenval
# first convert to percentage variance explained
<- data.frame(PC=1:20, pve=round(eigenval$V1/sum(eigenval$V1)*100,1))
pve
# calculate the cumulative sum of the percentage variance explained
cumsum(pve$pve)
# plot the PC's
<- ggplot(pve, aes(PC, pve)) + geom_bar(stat = "identity") +
a ylab("Percentage variance explained") +
theme_classic()
a
How many PC’s do you think we should use with our data?
####################
# plot the PCA
####################
# rename our columns, just for ease
colnames(dat) <- c("ID", "ID2", "PC1", "PC2", "PC3", "PC4", colnames(dat)[7:ncol(dat)])
# add a population label:
$population <- substr(dat$ID, 1,2)
dat
# plot the PCA
<- ggplot(dat, aes(PC1, PC2, fill=population)) +
d geom_point(size=4.5, shape=21, color="black") +
xlab(paste0("PC1: ",pve$pve[1],"% variance")) +
ylab(paste0("PC2: ",pve$pve[2],"% variance")) +
theme_bw() +
scale_fill_manual(values=c("#68228B",'#B22222',"#CD6090","#87CEFA", "#1874CD"))
d
# optional, output your pca to a pdf
#ggsave("pca.pdf",d, w=5, h=3.7)
Another classic, powerful, and another one of the most popular analysis in population genetics is STRUCTURE, which was published in 2000 and since has accumulated >36,000 citations. Structure’s is a model based method that infers population structure. The goal is to cluster individuals into hypothetical ancestral populations in a manner that individuals within a population are in Hardy-Weinberg equilibrium and linkage equilibrium. The program is iteratively assigning individuals to populations until it converges on population assignments that have the most likely allele frequency compositions.
However, STRUCTURE is slow and numerous other methods have been developed that do the same thing in a much faster way. One is ADMIXTURE, which we will use here. See the manual here
Admixture requires LD pruned variants in plink ormat, which we already generated above for our PCA.
Running admixture is relatively easy and short:
admixture --cv variants_NoLD.bed 2 | tee log_2.out
Here, we call the program with admixture
, we use
--cv
to say we want to calculate the cross validation score
(we’ll get to this later), we specify our bed file, and we tell the
programs how many populations to assume. The |
and
tee
just outputs the stdout to a file that we can save.
I’m sure you remember from lecture, but we need to run this for multiple values of K in order to determine the most likely number of populations.
What are sensible values of K to run?
We could just copy the code above for multiple K’s, but this can be done more elegantly below:
for K in 1 2 3 4 5; \
do admixture --cv variants_NoLD.bed $K | tee log${K}.out; done
and we need to grab the cross validation scores, to determine the most likely K:
grep -h CV log*.out | cut -f 3- -d " " > cv.txt
grep
is saying to find lines with CV
in our
log files and print them. We then drop the first few columns so it is
easier to plot next. We will come back to these shortly.
### plot the ADMIXTURE results
library(tidyverse)
<- read_tsv("~/shared_materials/tutorial_files/indivs.txt",
samplelist col_names = "individual")
# we could read in one data frame at a time:
read_delim("~/my_materials/population_structure/variants_NoLD.2.Q",
col_names = paste0("Q",seq(1:2)),
delim=" ")
# read in all date, in a loop
## first create an empty dataframe
<- tibble(individual=character(),
all_data k=numeric(),
Q=character(),
value=numeric())
# then add all results to this
for (k in 1:5){
<- read_delim(paste0("~/my_materials/population_structure/variants_NoLD.",k,".Q"),
data col_names = paste0("Q",seq(1:k)),
delim=" ")
$sample <- samplelist$individual
data$k <- k
data#This step converts from wide to long.
%>% gather(Q, value, -sample,-k) -> data
data <- rbind(all_data,data)
all_data
}
# add the population label
$population <- substr(all_data$sample, 1, 2)
all_data$population <- factor(all_data$population,
all_datalevels=c("GA", "PL", "HP", "BC", "PC", "TR"))
# our orders are off in our vcf. lets re-order these from south to north.
<- read_tsv("~/shared_materials/tutorial_files/population_order.txt",
orderlist col_names = "sample")
$sample<-factor(all_data$sample,levels=orderlist$sample)
all_data
# first, only plot k=2
%>%
all_data filter(k == 2) %>%
ggplot(.,aes(x=sample,y=value,fill=factor(Q))) +
geom_rug(aes(x=sample, y=value, color=population)) +
geom_bar(stat="identity",position="stack") +
xlab("Sample") + ylab("Ancestry") +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_fill_brewer(palette="Set1",name="K",
labels=c("1","2"))
Stop, and we will discuss this result
# plot all k values.
<- ggplot(all_data,aes(x=sample,y=value,fill=factor(Q))) +
p geom_bar(stat="identity",position="stack") +
geom_rug(aes(x=sample, color=population), inherit.aes=F) +
xlab("Sample") + ylab("Ancestry") +
theme_bw() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
scale_fill_brewer(palette="Set1",name="K",
labels=seq(1:5)) +
facet_wrap(~k,ncol=1)
p# ggsave("Admixture_plot.pdf", p, width = 7, height = 15, units="in")
### what is our most likely k?
# read in CV scores:
<- read.csv("~/my_materials/population_structure/cv.txt", sep=":", header=F)
cvin colnames(cvin) <- c("id", "cv")
# fix the formatting to get K into numeric format
$K <- substr(cvin$id, 4, 4)
cvin
# plot the results
ggplot(cvin,aes(x=K,y=cv)) +
geom_point(size=3) + geom_line(group=1)
We can also calculate genome-wide Fst to understand how population
may be structured. We could do this with vcftools, but we’ll use the R
package snpR
instead. Note that if you have a large vcf
file, you’ll need a lot of memory to run this in R- you’d be better off
with vcftools.
library(snpR)
# read in data file:
<- read_vcf("~/shared_materials/tutorial_files/variants.vcf")
dat
# add meta data information:
## population
<- data.frame(pop = substr(colnames(dat), 1, 2))
sample_meta ## order the population
$pop <- factor(sample_meta$pop, levels=c("GA", "HP", "BC", "PC", "TR"))
sample_meta
# assign meta data to dat
sample.meta(dat) <- sample_meta
# what does the actual data look like?
genotypes(dat)[1:6, 1:6]
# calculate fst between the populations
<- calc_pairwise_fst(dat, facets="pop", method = "WC") my.dat
Now that we’ve estimated Fst, we can look at our results:
# first look at the results, just the head
head(get.snpR.stats(my.dat, facets = "pop", stats = "fst"))
# heatmap of the fst estimates:
plot_pairwise_fst_heatmap(my.dat, facets="pop")
Again, does this make sense biologically?
We can also ask if our populations follow isolation by distance.
# pull out the fst estimates first:
<- get.snpR.stats(my.dat, facets = "pop", stats = "fst")$weighted.means
fst_out
# I have pre-calculated the ditances between the populations, found in the following file.
<- read.csv("~/shared_materials/tutorial_files/distances.csv", header=T)
distances ## the distances are in km
# merge our dataframes:
<- merge(distances, fst_out, by="subfacet") dist.df
Plot your data to see if there is isolation by distance.
<- ggplot(dist.df,aes(x=distance,y=weighted_mean_fst)) +
p geom_point() +
theme_classic()
p
# and we can add the population labels
<- p + geom_text(label=dist.df$subfacet) p
How would you add statistical strength to your conclusion?
# then add a regression line:
+ geom_smooth(method=lm) p
I hope you’ll see that these results look a little funny. Can you explain what is going on here?