BecA Bioinformatics

Quick Links

Useful Resources

Quality control of illumina reads

This module will guide you on how to go about quality control of data generated by the MiSeq. The objectives of this section are to

Visualise the quality of your reads
remove any adapter sequences present in the data,
remove all bases with a Phred quality score of less than 20,
graph the distributions of quality scores and nucleotides, and
calculate the fractions of duplicate and singleton reads in the data.

The main tool we will use for assessing the quality of Illumina reads is FastQC. FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

LINK: FASTQC Webpage

What is a FASTQ file?

Once the sequencing is finished, the data becomes available for download as "fastq" text files, in which each short read takes up four lines:

The first line (starting with an @) is a read identifier,

the second is the DNA sequence,

the third another identifier (same as line 1, but starting with a +(or sometimes only consisting of a +))

the fourth is a Phred quality score symbol for each base in the read.

The quality score of a base is known as a Phred or Q score. It is an integer representing the estimated probability of an error, based on the ASCII character code used by computer keyboards (http://www.ascii-code.com/).

Illumina's current sequencing pipeline (as of January 2012) uses an offset of 64, so that an @ (ASCII code 64) is 0, and h (ASCII code 104) is 40 (104-64)

Other versions of the pipeline might use different offsets, however. If you have data with a different offset value, you will need to modify your commands accordingly to inform programs that this is the case

The quality score for each base ranges from -5 to 40 and is defined as Qphred =-10 log10(p), where p is the estimated probability of a base call being wrong. So a Qphred of 20 corresponds to a 99 % probability of a correctly identified base.

Checking the Quality of the Sequencing reads using FASTQC

Using Graphical User Interface (GUI)
Using the GUI is easy and intuitive. However it is time consuming, and cannot handle multiple files at a time (Batch processing). However, it is ideal when you want a quick look at few fastq files.
To use the GUI, you need:
- Have a suitable Java Runtime environmnet installed in your computer
- Install an X11 client for your machine, such as XQuartz. If you are using MobaXTerm, this is not necessary, as X11 is part of the software.
- Enable X11 forwarding during login, by logging in using the -X parameter
  ssh -X username@hpc.ilri.cgiar.org
After logging in, type: module load fastqc
fastqc After a second or two, a fastqc splash screen will appear.
To do a QC on your file, click on File | Open | Browse to where you have saved your sequences and select them. The analysis will start immediately, and you will be able to browse through the results.

Using command line
As previously illustrated, using command line interface (CLI) is much more powerful, fast and convenient when handling and processing multiple files at a go, as compared to the Graphical User Interface (GUI).

After logging in, cd to the directory where your sequences are, then type: module load fastqc
fastqc sequence1 sequence2 sequence3 ... sequenceN where sequence1 sequence2 sequence3 are the fastq sequence files.

Alternatively, if you have many fastq files and would like to do analysis for all of them you could use an asterix ‘*’ as shown below to select all sequences with the extension ‘.fastq’. fastqc *.fastq The sequences are saved in the current directory as a HTML, and as a zip file. Download the Zip file to your local computer, Unzip it, then open the html file to visualise your result.

To have a good understanding of the good data vs bad data, click on the following links:
Good quality sequences
Bad quality sequence

Removal of adapter sequences

Trimming your data would be the next step after looking at its quality and making judgment on what data you need to trim. From the QC reports generated by FastQC it is possible to see if you have adapter contamination. It is necessary to cleave off these adapters.

Adapter trimming using Scythe

adapter.fasta

vim

>N02
ATCGTGG
>N05
AFCTCTA

module load scythe
scythe –a adapter.fasta read_R1.fastq -o read_R1.ad.fastq
scythe –a adapter.fasta read_R2.fastq -o read_R2.ad.fastq

scythe –a adapter.fasta read_R1.fastq -o read_R1.ad.fastq 2> read_R1.as.stdErr
scythe –a adapter.fasta read_R2.fastq -o read_R2.ad.fastq 2> read_R2.as.stdErr

Adapter trimming using Fastx-Toolkit

module load fastx_toolkit
fastx_clipper -a ACTTGA –i sample.fastq –o sample.ad.fastq –v –Q 33

(-a) states the sequence of the adapter,
(-i) allows you to define your input fastq file
(-o) states what your ouput file should be named. we add the extension; ad.fastq to distinguish the input and output.
(-v) verbose allows reports to be printed to the STDERR.

Read Trimming

Trimming your data would be the next step after looking at its quality and making judgment on what data you need to trim
We will look at how to do adapter trimming using:

SolexaQA

DynamicTrim

module load SolexaQA++
module load R
mkdir dynamictrim
SolexaQA++ dynamictrim read_R1.ad.fastq -h 22 -b -d dynamictrim/ --sanger

SolexaQA

(SolexaQA++ dynamictrim)

(read_R1.ad.fastq)

sbatch dynamicTrim.sbatch

#!/bin/env bash
#SBATCH -p batch
#SBATCH -n 8
#SBATCH -J dynamictrim
#SBATCH -w taurus

#load the programs needed Solexa and R
module load SolexaQA++/3.1.3
module load R

# create a directory to direct your results
rm -rf dynamictrim
mkdir dynamictrim

#Run fastqc on all the “.ad.fastq” files in your current directory
for i in *.ad.fastq
do
SolexaQA++ dynamictrim $i -h 22 -b -d dynamictrim/ --sanger
done

.trimmed

FASTX-Toolkit

fastq_quality_filter

fastx_trimmer

module load fastx_toolkit
fastq_quality_filter -q 20 -p 80 -Q 30 -i read_R1.ad.fastq | fastx_trimmer -f 15 -Q 30 -o read_R1.ad.trim.fastq

call the tool fastq_quality_filter,
(-i) states the input / read to be process
(-q) minimum quality score to keep
(-p) Minimum percent of bases that must have
[-q] quality
(|) the pipe passes the results of the fastq_quality_filer to fastx_trimmer.
its an easy way of passing results of one program to the next. The options
(-f) states the first base to keep “any base before 15 in this case is trimmed off”
(-l) last base to keep can also be stated if you want to trim the 3’ end of your data.
(-o) states the output file to which results should be written.

Run FastQC on the adapter-trimmed reads, and then on the adapter- and quality-trimmed reads, and examine the output.

for i in *.fastq;
do
echo working with "$i"
newfile="$(basename $i .fastq)"
fastq_quality_trimmer -Q33 -v -t 20 - l 20 -i $i |fastx_clipper -Q33 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG - l 20 -n -v > ${newfile}.trimmed.clipped.fastq
done;

Optional: Additional steps

Duplicate reads and singletons

YOURFILE_collapsed.txt

(fastqduplicatecounter.py)

for i in *.clipped.fastq;
do
echo working with "$i"
newfile="$(basename $i .trimmed.clipped.fastq)"
fastx_collapser -Q33 -v -i $i -o temp.collapsed.txt
python ../scripts/fastqduplicatecounter.py temp.collapsed.txt temp.collapsed.headers.txt>> ${newfile}.duplicateCount.txt
rm temp.collapsed.headers.txt
rm temp.collapsed.txt
done;

YOURFILE_duplicatecount.txt

Quality stats

for i in *.clipped.fastq; do echo working with "$i" newfile="$(basename $i .trimmed.clipped.fastq)" fastx_quality_stats -Q33 -i $i -o ${newfile}.qualstats.txt done;

Quality score boxplot of 50-bp Illumina reads (after quality trimming, Q>20), summarized by read position. Lower scores in the beginning of the reads is an artifact of the software used to calculate base quality scores.

Nucleotide distribution chart of 50-bp Illumina reads, summarized by read position. A non-random distribution in the first 12 bases is common, and is thought to be an artifact of the random hexamer priming during sample preparation.

Combine Paired Read

PEcombiner.sh

PE combiner

fastxcombinepairedend.py

How do I know the SEQHEADER and DELIMITER

head –n 1 YOURFILE_trimmed_clipped.fastq

SEQHEADER will be the first 4 characters of this line (including the @ symbol).
DELIMITER will be the character separating the last part of the identifier (that tells the software if the read is a forward or reverse read) from the rest of the identifier. It is usually either a "/" or a " " (space) character.

./PECombiner.sh

_stillpaired

_singles

fastxcombinepairedend.py

I: YOURFILE_trimmed_clipped#1_1.fastq (forward)
YOURFILE_trimmed_clipped#1_2.fastq (reverse)
O: YOURFILE_1_trimmed_clipped_stillpaired.fastq
YOURFILE_2_trimmed_clipped_stillpaired.fastq
YOURFILE_trimmed_clipped_singles.fastq

Summary

We have now cleaned up the raw data, so that they can be used for creating a de novo assembly or for mapping against a reference.
Low quality bases and adapter sequences have been removed. We have also verified that the reads are not all identical, which would suggest an error somewhere in the sample preparation pipeline. We have also examined the trimmed dataset to make sure that quality scores are high and that nucleotides are evenly distributed.