When the data comes back from the sequencer, it is one giant text file in fastq format that includes all data from your lane of sequencing. Remember about barcodes- we can use these to demultiplex our data (or really the sequencing center will do it for you, depending on your data type).
In general, you’ll take some fairly consistent steps, with a lot of options, to process your data so you can get to the point of calling snps. These are:
You can find the data for the tutorial in the directory
~/shared_materials/tutorial_files
For our purposes will use just one sample to learn about data
processing. This is sample, TR-023
, which is randomly
chosen. After demultiplexing, we have two files associated with this
sample: TR-023_RA.fastq.gz
and
TR-023_RB.fastq.gz
Why do we have two files for one sample?
These files are in fastq format, shown below:
@J00113:190:HFV3LBBXX:7:1101:3742:1297 1:N:0:NAGCTT
TGCAGGCTCAGTCCTGATGAGGGAGCCATCTTTATAGAAAGCAGCTGGGAGGTTGGAGGGAGCGGTCTTTGTGTGACAGCTCAGAGTGACGGCAGCTCCCTCCATCACATGGAGTACAGGATTCTGCATGATTTCTGATC
+
FJJJJFJJJFJFJJJJAJJ-FJJFFJJJAAAAF-<AJFFFFJJJFAFFF<A7-<FFAJJJJ<JAJ7A<AJFJJJJJJJJJA-FAJA-AAFF<AF-F--AF7FAJJ77-7JJJA-7--AJJA-AFFF-----<--7--777
@J00113:190:HFV3LBBXX:7:1101:13738:1314 1:N:0:NAGCTT
TGCAGGACTCGTTGTTGATGAGGAAATGGTCACCACATGCTGACATTGTGGGTAAGCCAGTATTTCAAATAGTGGTTCCCAGAAAACTCCGTTCTGCAGTGTGAAAAGTTGCTCTTGATGAGTCTGGTCATTCTGGTGTA
+
JJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJ<FJ7FJJJFJJJJJJJFJJJJJJ-<FFFJJJ-A-F)7<J-7<-<F-AFJ
@J00113:190:HFV3LBBXX:7:1101:19015:1314 1:N:0:NAGCTT
TGCAGGTTTAAGCTCAGAAGACGACGCCCCTGCACTGAAATACAGGAAGAAGATCTGCATGAACCAGCCTCACCTGCGGTTTGAAATCACCAGCGAGGACGGCTTCAGCGTTAAGGCTAACAGCATAGAGGGTAGGAGTG
In the fastq file, each read consists of 4 lines:
You can look at your fastq file yourself. However, these are gz files.
Try to look at your fastq file with head
. What do you
see?
You need to use zcat your_fastqFile
. Warning! This will
print your whole file! You can do this and see what happens.
control + c
will cancel the printing.
Instead, we need to use |
, or a pipe. So we can pipe the
output of zcat
to another command, in this case
head
.
zcat TR-023_RA.fastq.gz | head
We can also count how many lines are in our file. you do this with
wc -l
.
Use zcat
, |
, and wc -l
to
count the number of reads in your fastq files.
We first need to make a directory to hold the output from these quality control steps.
my_materials
, make a directory called
quality_control
.We will use fastqc to check the quality of our sequencing. Briefly, this program scans our fastq files and checks for quality, contamination from adapters, and sequence bias, among others.
Look at the help page of fastqc using fastqc --help
We can then run fastqc for our forward read:
# move to your quality_control directory before running this code
fastqc ~/shared_materials/tutorial_files/TR-023_RA.fastq.gz -o .
You can look at the results of this program by opening (double clicking) the html file in the browswer in cloudlab.
We don’t have much evidence for adapter contamination with this sample, but we do see evidence for some quality issues, which are normal. We can deal with these issues by trimming. Note, this isn’t totally necessary in all cases as many alignment software can deal with low quality bases.
We will use Trim Galore to clean up our data. There are many programs you can use for this- we’re using Trim Galore because it is relatively simple and because it auto detects adapters.
Below is the code to run this program. Again, you can get help using
trim_galore --help
The options that are relevant here are:
--paired
: Telling Trim Galore our data is paired-q 30
: quality threshold for trimming. Commonly 20 or
30 where Q = -10log10(e). so Q20 is 99% probabilty of a correct call and
Q30 is 99.9%.--length 30
:Length at which to toss out short
reads--stringency 4
: the overlap required for an adapter to
be considered a match.Run trim galore on your data.
trim_galore --paired \
\
~/shared_materials/tutorial_files/TR-023_RA.fastq.gz \
~/shared_materials/tutorial_files/TR-023_RB.fastq.gz -q 30 --length 30 --stringency 4
After trim_galore finishes, run fastqc on your newly trimmed data.
We’re not actually doing this!
Short lecture on alignment, reference genomes, and Stacks.
Output files are in sam or bam format where bam files are compressed versions of sam files.
Use head
to look at your bam output. What do you
see?
We need to use another program, samtools
to look at our
bam file. Again, use samtools --help
or just
samtools
to get an overview of this program.
The most basic tool in samtools is samtools view
, which
prints the bam file in plain text. You can see the help page with
samtools view --help
.
the basic syntax is: samtools view NAME.bam
…. remeber
head
and |
.
Look at your bam file using samtools.
You should see something like this:
J00113:190:HFV3LBBXX:7:1223:20912:25914 163 NW_012224401.1 27218 60 150M = 27487 409 CACACAGCTGCTGAGTAGCCAGAATATGAAAAGCAGTGTCAAGCTGTTCATATCTGGGTAAAGTTTTGAGTTTGGAAACGTATCTACAGCTGATATTACATGGTTATAATACACTGATTGATGAGCGTCTTTTCTCTGACTCATCCTGGT AAFFFJJJJJJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJJJJFJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAFJJJJJFFJF<FFJJJJJJJJJJJJJJJJJ NM:i:1 MD:Z:60G89 AS:i:145 XS:i:115 RG:Z:AC-3-TR-023-023
So what’s going on in this file? Its a lot to digest.
Col | Field | Brief_description | our_read |
---|---|---|---|
1 | QNAME | Query template NAME | J00113:190:HFV3LBBXX:7:1223:20912:25914 |
2 | FLAG | bitwise FLAG | 163 |
3 | RNAME | References sequence NAME | NW_012224401.1 |
4 | POS | 1-based leftmost mapping POSition | 27218 |
5 | MAPQ | MAPping Quality | 60 |
6 | CIGAR | CIGAR string | 150M |
7 | RNEXT | Ref. name of the mate/next read | = |
8 | PNEXT | Position of the mate/next read | 27487 |
9 | TLEN | observed Template LENgth | 409 |
10 | SEQ | segment SEQuence | CACACAG… |
11 | QUAL | ASCII of Phred-scaled base QUALity plus 33 | AAFFFFJ… |
12 | NM:i:count | Number of differences between the sequence and reference | NM:i:1 |
13 | MD:Z: | String encoding mismatched and deleted reference bases | MD:Z:60G89 |
14 | AS:i:score | Alignment score generated by aligner | AS:i:145 |
15 | XS:i:score | Seconday alignment score | XS:i:115 |
16 | RG:Z:readgroup | Read group identification | RG:Z:AC-3-TR-023-023 |
Walk through the table above together
The easiest way to look at how well our mapping worked is to use
samtools flagstat TR-023.bam
, which gives us this
output:
1147073 + 0 in total (QC-passed reads + QC-failed reads)
1124360 + 0 primary
0 + 0 secondary
22713 + 0 supplementary
238412 + 0 duplicates
234206 + 0 primary duplicates
1127675 + 0 mapped (98.31% : N/A)
1104962 + 0 primary mapped (98.27% : N/A)
1124360 + 0 paired in sequencing
562180 + 0 read1
562180 + 0 read2
1003448 + 0 properly paired (89.25% : N/A)
1097766 + 0 with itself and mate mapped
7196 + 0 singletons (0.64% : N/A)
86662 + 0 with mate mapped to a different chr
53567 + 0 with mate mapped to a different chr (mapQ>=5)
Short lecture/discussion on these results
Remember the flags in your bam file? You can filter your reads based
on this (which is what samtools flagstat
is doing).
However, we can be more specific. The syntax to do this is:
samtools view -f # TR-023.bam
. Where -f
can
also be -F
and #
is the flag you want to
use.
-f
to find the reads that agree with the flag
statement-F
to find the reads that do not agree with the flag
statementAt this website you can generate any flag definition you want (or get a flag explained).
Go to http://broadinstitute.github.io/picard/explain-flags.html
How would you find the number of mapped reads in your bam file? Do this now.
samtools flagstat
?You can filter by quality by adding a -q ##
option. A
score of 20 is a common threshold.
We can look at the mapping quality overall:
samtools view TR-023.bam | cut -f 5 > ~/my_materials/quality_control/mapq.txt
hist()
).