Once we have our aligned reads that we’re happy with, we need to identify variants. There are lots of ways to do this. However, we’re not actually going to do any of them. It is fairly slow and you don’t gain a lot from waiting for things to run. Instead, I already called variants with FreeBayes. We could have also used bcftools, GATK, Angsd, etc etc.
What all of these methods are doing similarly is looking for
locations in the genome where there is variation. For example, one
individual has a genotype of A/G
, the other
A/A
.
Short snp calling lecture
For most variant callling software, the output is VCF. In our case,
we’re only looking at SNPs, but we could also have insertions or
deletions (indels). There are a number of tools we can use to work with
VCF files. The most common are VCFtools and BCFtools.
BCFtools is a bit faster, but these two tools do basically the same
things. We’ll be using vcftools. You can run this by simply typing
vcftools
in the terminal.
Our vcf file is
~/shared_materials/tutorial_files/variants.vcf
.
First look at the head
of this file.
Look at the entire header with grep
. How might you
do this?
Now use grep to look at the head of the vcf without the header.
Hint, -v
finds the inverse of your match.
The header contains a ton of information. Mostly, these are the definitions of the abbreviations used in the rest of the file, generally how the file was generated, and various other information. The real core of the vcf can be found in the last line of the header.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BC-045-001BC-046-002 BC-047-003 … chr1 11504 chr1:11504 C G 95103.7 . . GT:DP:RO:QR:AO:QA:GL .:.:.:.:.:.:. 0/0:12:12:492:0:0:0,-3.61236,-44.6308 0/0:16:16:629:0:0:0,-4.81648,-56.9351
CHROM
the name of the chromosome (or scaffold) of the
reference genome where the variant is located.POS
The 1-based position of the variation on the given
sequence.ID
The SNP id, many times just the CHR and POS with a
:
delimiter. Human genomes have more formal naming.REF
The base in the reference genome (or bases in the
case of an indel). i.e., A,C,G,T.ALT
The SNP base. The A,C,G,T that doesn’t match the
reference.QUAL
A quality score associated with the variant. How
confident are we that this is a real variant?FILTER
A flag indicating if the variant has failed or
PASS if all the filters were passed successfully.INFO
A list of information about each site. These are
defined in the header.FORMAT
The format of the genotype field for each
sample. Also defined in the header.Each following column is an individual genotype.
Our genotype format is:
GT:DP:RO:QR:AO:QA:GL
Look up these definitions in the header and we will discuss.
See here for more info about the file format
I have given you a pre-filtered vcf file. This was mostly done for space, time, and computational reasons.
Filtering is hard and arbitrary. Generally, you need to consider the following:
Let’s now look at a few quality controls in our vcf file.
Note that I relied on a few existing tutorials to help build this exercise: - https://speciationgenomics.github.io/filtering_vcfs/ - https://www.ddocent.com/filtering/
Our vcf file is in the the shared folder, which you cannot write to. We can do a few things to let you type less.
filtering
in the my_materials
directoryVCF=~/shared_materials/tutorial_files/variants.vcf
OUT=~/my_materials/filtering/variants
You can then call these variables by entering $VCF
or
$OUT
Your vcf file is relatively small. You can see the file size using
ls -lh
. It is not uncommon to have a vcf file that is many
GB in size.
You can count up the actual variants to get a better idea of your
data quality. You should have a pretty good idea how you might do this.
Hint, you’ll need to use grep -v
to ignore the header, and
wc -l
to count. Do this now.
vcftools --vcf $VCF --depth --out $OUT
This outputs the file: variants.idepth
vcftools --vcf $VCF --site-mean-depth --out $OUT
file: variants.ldepth.mean
vcftools --vcf $VCF --missing-indv --out $OUT
file: variants.imiss
vcftools --vcf $VCF --missing-site --out $OUT
file: variants.lmiss
We will plot these results in R using ggplot2. One lesson in genomics is to make a million plots and look at your data in many ways. We want to make sure nothing weird is going on. So we plot everything and think about if these plots make sense.
As a reminder, we have the following files to look at:
variants.idepth
variants.ldepth.mean
variants.imiss
variants.lmiss
variants.lqual
For all of the following, you need to use tidyverse.
# load tidyverse package
library(tidyverse)
Here we’re calculating the number of reads that have mapped to the position.
The output is the mean and variance across all individuals.
# read in the data
<- read_delim("~/my_materials/filtering/variants.ldepth.mean", delim = "\t",
variant_depth col_names = c("chr", "pos", "mean_depth", "variance_depth"), skip = 1)
# plot with ggplot
<- ggplot(variant_depth, aes(mean_depth)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
p theme_classic()
p
# use summary to summarize the output
summary(variant_depth$mean_depth)
Next, we look at the proportion of missing data for each variant.
<- read_delim("~/my_materials/filtering/variants.lmiss", delim = "\t",
variant_miss col_names = c("chr", "pos", "nchr", "nfiltered", "nmiss", "fmiss"), skip = 1)
<- ggplot(variant_miss, aes(fmiss)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
p theme_classic()
p
summary(variant_miss$fmiss)
<- read_delim("~/my_materials/filtering/variants.idepth", delim = "\t",
indiv_depth col_names = c("ind", "nsites", "depth"), skip = 1)
<- ggplot(indiv_depth, aes(depth)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
p theme_classic()
p
summary(indiv_depth$depth)
<- read_delim("~/my_materials/filtering/variants.imiss", delim = "\t",
indiv_miss col_names = c("ind", "ndata", "nfiltered", "nmiss", "fmiss"), skip = 1)
<- ggplot(indiv_miss, aes(fmiss)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
p theme_classic()
p
summary(indiv_miss$fmiss)
If we saw issues above, we could add additional filters to our data.
You can do this in vcftools. It woud look something like this where you
would replace $NUM with your threshold. You also need to add
recode
when writing a new vcf file. And you choose the name
of your filtered vcf with the --out
variable.
See the vcftools manual for more info.
If we have time, do some additional filtering and check how your results above change.
vcftools --vcf $VCF \
--max-missing $NUM \
--min-meanDP $NUM \
--max-meanDP $NUM \
--minDP $NUM \
--maxDP $NUM \
--recode \
--out filtered_vcf