Call snps

Once we have our aligned reads that we’re happy with, we need to identify variants. There are lots of ways to do this. However, we’re not actually going to do any of them. It is fairly slow and you don’t gain a lot from waiting for things to run. Instead, I already called variants with FreeBayes. We could have also used bcftools, GATK, Angsd, etc etc.

What all of these methods are doing similarly is looking for locations in the genome where there is variation. For example, one individual has a genotype of A/G, the other A/A.

Short snp calling lecture

Working with SNP data

For most variant callling software, the output is VCF. In our case, we’re only looking at SNPs, but we could also have insertions or deletions (indels). There are a number of tools we can use to work with VCF files. The most common are VCFtools and BCFtools. BCFtools is a bit faster, but these two tools do basically the same things. We’ll be using vcftools. You can run this by simply typing vcftools in the terminal.

Our vcf file is ~/shared_materials/tutorial_files/variants.vcf.

First look at the head of this file.
Look at the entire header with grep. How might you do this?
Now use grep to look at the head of the vcf without the header. Hint, -v finds the inverse of your match.

vcf format

The header contains a ton of information. Mostly, these are the definitions of the abbreviations used in the rest of the file, generally how the file was generated, and various other information. The real core of the vcf can be found in the last line of the header.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT BC-045-001BC-046-002 BC-047-003 … chr1 11504 chr1:11504 C G 95103.7 . . GT:DP:RO:QR:AO:QA:GL .:.:.:.:.:.:. 0/0:12:12:492:0:0:0,-3.61236,-44.6308 0/0:16:16:629:0:0:0,-4.81648,-56.9351

CHROM the name of the chromosome (or scaffold) of the reference genome where the variant is located.
POS The 1-based position of the variation on the given sequence.
ID The SNP id, many times just the CHR and POS with a : delimiter. Human genomes have more formal naming.
REF The base in the reference genome (or bases in the case of an indel). i.e., A,C,G,T.
ALT The SNP base. The A,C,G,T that doesn’t match the reference.
QUAL A quality score associated with the variant. How confident are we that this is a real variant?
FILTER A flag indicating if the variant has failed or PASS if all the filters were passed successfully.
INFO A list of information about each site. These are defined in the header.
FORMAT The format of the genotype field for each sample. Also defined in the header.

Each following column is an individual genotype.

Our genotype format is:

GT:DP:RO:QR:AO:QA:GL

Look up these definitions in the header and we will discuss.

See here for more info about the file format

Filtering variants

I have given you a pre-filtered vcf file. This was mostly done for space, time, and computational reasons.

Filtering is hard and arbitrary. Generally, you need to consider the following:

missingness: how many samples are missing data at a site? Do any individuals have a large amount of missing data?
Quality: genotypes that have low quality are probably not real. You should drop these.
Minor allele frequency: Very rare variants are also prone to be errors or simply not informative.
Depth: This one is hard. There are newer methods that can use very low quality data to infer genotypes, primarily Angsd. Typicaly for RADseq, you usually want some minimum (~10x?) and some maximum. Minimums make genotype calls more accurate and maximum removes sites that are likely multi-mapping or similar.

Let’s now look at a few quality controls in our vcf file.

Note that I relied on a few existing tutorials to help build this exercise: - https://speciationgenomics.github.io/filtering_vcfs/ - https://www.ddocent.com/filtering/

Our vcf file is in the the shared folder, which you cannot write to. We can do a few things to let you type less.

make a new directory in your home directory called filtering in the my_materials directory
set some variables that we can refer to below. This means we won’t have to worry about what directory we’re in when running vcftools- the output will be saved to the correct location.

VCF=~/shared_materials/tutorial_files/variants.vcf
OUT=~/my_materials/filtering/variants

You can then call these variables by entering $VCF or $OUT

count variants

Your vcf file is relatively small. You can see the file size using ls -lh. It is not uncommon to have a vcf file that is many GB in size.

You can count up the actual variants to get a better idea of your data quality. You should have a pretty good idea how you might do this. Hint, you’ll need to use grep -v to ignore the header, and wc -l to count. Do this now.

mean depth per individual

vcftools --vcf $VCF --depth --out $OUT

This outputs the file: variants.idepth

mean depth per site

vcftools --vcf $VCF --site-mean-depth --out $OUT

file: variants.ldepth.mean

missingness per individual

vcftools --vcf $VCF --missing-indv --out $OUT

file: variants.imiss

missingness per site

vcftools --vcf $VCF --missing-site --out $OUT

file: variants.lmiss

analyze these data in R

We will plot these results in R using ggplot2. One lesson in genomics is to make a million plots and look at your data in many ways. We want to make sure nothing weird is going on. So we plot everything and think about if these plots make sense.

As a reminder, we have the following files to look at:

mean depth per individual: variants.idepth
mean depth per site: variants.ldepth.mean
Missingness per individual: variants.imiss
Missingness per site: variants.lmiss
Site quality: variants.lqual

For all of the following, you need to use tidyverse.

# load tidyverse package
library(tidyverse)

variant statistics

variant depth

Here we’re calculating the number of reads that have mapped to the position.

The output is the mean and variance across all individuals.

# read in the data
variant_depth <- read_delim("~/my_materials/filtering/variants.ldepth.mean", delim = "\t",
                            col_names = c("chr", "pos", "mean_depth", "variance_depth"), skip = 1)

# plot with ggplot
p <- ggplot(variant_depth, aes(mean_depth)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) + 
  theme_classic()
p 

# use summary to summarize the output
summary(variant_depth$mean_depth)

variant missingness

Next, we look at the proportion of missing data for each variant.

variant_miss <- read_delim("~/my_materials/filtering/variants.lmiss", delim = "\t",
                       col_names = c("chr", "pos", "nchr", "nfiltered", "nmiss", "fmiss"), skip = 1)

p <- ggplot(variant_miss, aes(fmiss)) + geom_density(fill = "dodgerblue1", colour = "black", alpha = 0.3) +
  theme_classic()
p 

summary(variant_miss$fmiss)

individual statistics

Individual mean depth

indiv_depth <- read_delim("~/my_materials/filtering/variants.idepth", delim = "\t",
                        col_names = c("ind", "nsites", "depth"), skip = 1)

p <- ggplot(indiv_depth, aes(depth)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3) + 
        theme_classic()
p 

summary(indiv_depth$depth)

Individual missing data

indiv_miss  <- read_delim("~/my_materials/filtering/variants.imiss", delim = "\t",
                        col_names = c("ind", "ndata", "nfiltered", "nmiss", "fmiss"), skip = 1)
p <- ggplot(indiv_miss, aes(fmiss)) + geom_histogram(fill = "dodgerblue1", colour = "black", alpha = 0.3) + 
        theme_classic()
p 

summary(indiv_miss$fmiss)

Filtering VCF

If we saw issues above, we could add additional filters to our data. You can do this in vcftools. It woud look something like this where you would replace $NUM with your threshold. You also need to add recode when writing a new vcf file. And you choose the name of your filtered vcf with the --out variable.

See the vcftools manual for more info.

If we have time, do some additional filtering and check how your results above change.

vcftools --vcf $VCF \
    --max-missing $NUM \
    --min-meanDP $NUM \
    --max-meanDP $NUM \
    --minDP $NUM \
    --maxDP $NUM \
    --recode \
    --out filtered_vcf

Processing raw data