Whether we choose to use a de-novo assembly, or a reference guided approach, regardless of whole genome- or RNA seq data, it is CRUCIAL to remove adapter sequences, and low quality bases from your reads.
If you havent already done the quality control (QC) step, on your data,
Whole genome vs Whole exome sequencing
Next-generation DNA and RNA sequencing have made it possible not only to look at individual genomes, but also to rapidly compare genetic sequences among multiple genomes. These approaches can be used, for example, to determine differences in genomes and gene transcripts between individual organisms, between populations, and between different cells. eg normal vs pathologic cells in cancer.
Next-generation technologies can quickly generate a sequence of a whole genome, or can be more targeted using an approach called exome sequencing; which focuses specifically on generating reads from known coding regions.
In contrast to whole genome sequencing, which sequences the entire genome, exome sequencing is a cost-effective approach that can detect single nucleotide or short indel variants in coding regions, and provides sufficient information for many research needs.
RNA-seq is a technique that measures the abundance of RNA transcripts in a sample. It is a powerful tool for understanding dynamics in the transcriptome, including gene expression level difference between different physiologic conditions, or changes that occur during development or over the course of disease progression. Specifically, this application can be used to study phenomena such as:
- gene expression changes,
- alternative splicing events,
- allele-specific gene expression,
- chimeric transcripts, including gene fusion events,
- novel transcripts,
- RNA editing
De novo assembly vs Read mapping
The choice of the approach to use depends on whether there is a reference genome / sequence that you can rely on to guide your alignment, or not.
If are working with an organism for which there is a reference genome available that you can use, download the reference in the FASTA format and do a reference mapping as will be described.
However, if you lack a suitable reference that can be used, you need to create your own catalog of contigs by performing a de novo assembly. A de novo assembly joins reads that overlap into contigs, while allowing a certain, user-defined, number of mismatches (variation at nucleotide positions that can be due to sequencing error or biological variation).
READ MAPPING: Different types of Aligners
The approach on reference mapping, depends on the source of reads (genomic vs mRNA); and the type of reference (genomic vs transcriptomic).
There are two types of aligners:
- Spliced aligners (ideal for RNA Seq)
These aligners are ideal for transcriptome assembly. These assembly programs can assemble reads spanning splice junctions (Exon-Exon junctions).
Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently.
- Unspliced aligners (ideal for genomic sequences)
These short-read aligners align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types:
1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and
2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms (used in alignment of DNA and protein sequences).
De novo assembly: WGS and RNA-Seq
Building a de novo assembly is a very memory-intensive process. There are many programs for this, both commandline (Trinity), and those with a GUI (CLC genomics workbench, Geneious)
Trinity is a represents a program to perform efficient and robust de novo reconstruction of transcriptomes from RNA-seq data.
Prior to the development of transcriptome assembly computer programs, transcriptome data were analyzed primarily by mapping on to a reference genome. Though genome alignment is a robust way of characterizing transcript sequences, this method is disadvantaged by its inability to account for incidents of structural alterations of mRNA transcripts, such as
alternative splicing.
Unlike genome sequence coverage levels – which can vary randomly as a result of repeat content in non-coding intron regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced isoforms, or minor variation among members of a gene family.
A number of assembly programs are available. Although these programs have been generally successful in assembling genomes, transcriptome assembly presents some unique challenges.
- Whereas high sequence coverage for a genome may indicate the presence of repetitive sequences (and thus be masked), for a transcriptome, they may indicate abundance.
- In addition, unlike genome sequencing, transcriptome sequencing can be strand-specific, due to the possibility of both sense and antisense transcripts.
- Finally, it can be difficult to reconstruct and tease apart all splicing isoforms.
Short read assemblers generally use one of two basic algorithms:
overlap graphs and
de Bruijn graphs.
- Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap.
- De Bruijn graphs align k-mers (usually 25-50 bp) based on k-1 sequence conservation to create contigs. The use of k-mers – which are shorter than the read lengths – in de Bruijn graphs reduces the computational intensity of this method.
When comparing the lengths and numbers of contigs acquired from de novo assemblies to the predicted number of transcripts from genome projects, the de novo contigs typically are shorter and more numerous. This is because the assembler cannot join contigs together unless there is enough overlap and coverage in the reads, so that several different contigs will match one mRNA transcript.
Abyss, velvet oases; soap denovo, trinity
Whether we choose to use a de-novo assembly, or a reference guided approach, it is CRUCIAL to remove adapter sequences, and low quality bases.
Select the below links for the appropriate approach for your data
Performing a Reference-Guided mapping
Performing a de-novo assembly
Popular Genome Browsers:
- UCSC Genome Browser
This is an awesome genome browser that puts lots of different information at your finger tips, including lots of published studies and ENCODE data. Big pluses: data integration.
Negatives: slower (web based), a little more difficult to upload large custom data sets.
- IGV (Integrative Genome Viewer)
This browser runs locally on your own computer (the more memory you have the better). It is Java based, and is easy to use on almost any computer. It doesn't have the same degree of shared information available as UCSC, but it is much faster for browsing across the genome. Also, it is better for looking at individual reads/looking for variants.
- Visualising using SAMTOOLS
As seen before, Samtools provides various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. SAM (Sequence Alignment/Map) format is a generic format. for storing large nucleotide sequence alignments.
- Indexing the genome
samtools faidx campy.fa
- Convert bowtie mappings to sam
bowtie2sam.pl xxx-pre-1m.map > xxx-pre-1m.map.sam
- Convert SAM -> BAM samtools import xxx.fa.fai xxx-pre-1m.map.sam xxx-pre-1m.map.bam
- Sort the BAM file samtools sort xxx-pre-1m.map.bam xxx-pre-1m.map.sorted
- Index BAM samtools index xxx-pre-1m.map.sorted.bam
- Browse the mappings: samtools view xxx-pre-1m.map.sorted.bam
- Others
There are tons of genome browsers out there that serve many different needs. For a list, check out this link.
Types of custom data files
A general list of common file formats can be found here. Popular formats are shown below:
bed - (*.bed) - BED files are very basic as they simply describe a simple region in the genome. They are usually used to describe ChIP-Seq peaks and things of that nature. Nearly every genome browser supports visualization of BED files.
wiggle - (*.wig) - Wiggle files are used to display quantitative information across genomic regions. Usually used to display read depth from ChIP- or RNA-seq experiments. Wiggle format is compact and displays data at regular intervals.
bedGraph - (*.bedGraph) - Similar to Wiggle files, these are used to display quantitative data across genomic regions. They use variable length intervals instead of constant intervals found in wiggle files, and are usually a little bigger in size.
bam - (*.bam) - Display individual reads. Bam files need to be sorted, and they need to have an index file along with the bam file to help the genome browser efficiently find reads in the bam file. It's best to use a local browser like IGV when visualizing bam files.
GFF/GTF - (*.gff *.gtf) - Extensible file formats for specifying spliced transcripts and genes. Transcript assembly programs like cufflinks will generate GTF files that you can then upload to a genome browser.
Server Resident Files:
In the case of web-based genome browsers such as UCSC, it can be difficult to upload large data files. To get around this issue, UCSC set up protocols to allow you to post your files on a webserver and then create a track that "points" to the location of your files. This requires a working webserver, but can be a powerful way to visualize bigwig files.
Click for More Info
Creating genome browser files
There are several specialized programs for creating genome browser files. Often, the output of a program is already suited to be loaded into a genome browser. For example, cufflinks generates a
"transcripts.gtf" file. Macs creates a peak bed file. Other common examples:
Visualize bam alignment files using samtools
Visualize read coverage files using BEDtools
Loading custom data into the UCSC Genome Browser
To look at your own data using the UCSC Genome Browser, click on the "Genomes" at the top and look for the "Add Custom Tracks" or "Manage Custom Tracks" button:
UCSC custom track:
After uploading your track, the data should appear in the genome browser. At the bottom of the browser image you'll find a variety of track settings. The section at the top controls the settings for custom tracks:
UCSC settings:
You can change how the track is visualized by clicking on the drop down menu, shown above. You can also click on the 'blue' link name and change other custom settings.
Visual Quality Control
There are several things to look out for when viewing your data in the browser. Below is a checklist to help guide you.
- Look for spikes in the data. These may be caused by contaminants, and may cause problems with data analysis.
ChIP-Seq
- Are there nice, defined peaks in the data? Or are there regions of continuous coverage (histone marks)?
- Are there reads on all expected chromsomes?
- Does the pattern match the experiment?
- TFs: Spikes of enrichment near the TSS and distal regulatory elements
- H3K4me3 - enriched near TSS
- H3K4me1/2, H3/H4ac, DNase - enriched near TSS and near distal regulatory elements
- H3K36me3 - enriched across gene bodies
- H3K27me3 - enriched near CpG Islands of inactive genes
-
- H3K9me3 - enriched across broad domains and repeat elements
- Is the background low, or almost as high as the expected peaks?
RNA-Seq
- Are most reads found on exons? Or is there a lot of reads in introns/other regions?
- Do you have even read coverage across exons, or is it full of strong spikes?
- Is there a 3' or 5' bias in the data?
- If strand specific, is it the correct strand?