BecA Bioinformatics

Quick Links

Useful Resources

Welcome to BecA bioinformatics page

De novo assembly
Reference Mapping
Understanding SAM/BAM alignment files
Visualization using genome browsers

Whether we choose to use a de-novo assembly, or a reference guided approach, regardless of whole genome- or RNA seq data, it is CRUCIAL to remove adapter sequences, and low quality bases from your reads.
If you havent already done the quality control (QC) step, on your data,

click here for NGS Quality Control

Whole genome vs Whole exome sequencing

Next-generation DNA and RNA sequencing have made it possible not only to look at individual genomes, but also to rapidly compare genetic sequences among multiple genomes. These approaches can be used, for example, to determine differences in genomes and gene transcripts between individual organisms, between populations, and between different cells. eg normal vs pathologic cells in cancer.
Next-generation technologies can quickly generate a sequence of a whole genome, or can be more targeted using an approach called exome sequencing; which focuses specifically on generating reads from known coding regions.
In contrast to whole genome sequencing, which sequences the entire genome, exome sequencing is a cost-effective approach that can detect single nucleotide or short indel variants in coding regions, and provides sufficient information for many research needs.

RNA-seq is a technique that measures the abundance of RNA transcripts in a sample. It is a powerful tool for understanding dynamics in the transcriptome, including gene expression level difference between different physiologic conditions, or changes that occur during development or over the course of disease progression. Specifically, this application can be used to study phenomena such as:

- gene expression changes,
- alternative splicing events,
- allele-specific gene expression,
- chimeric transcripts, including gene fusion events,
- novel transcripts,
- RNA editing

De novo assembly vs Read mapping

The choice of the approach to use depends on whether there is a reference genome / sequence that you can rely on to guide your alignment, or not. If are working with an organism for which there is a reference genome available that you can use, download the reference in the FASTA format and do a reference mapping as will be described.
However, if you lack a suitable reference that can be used, you need to create your own catalog of contigs by performing a de novo assembly. A de novo assembly joins reads that overlap into contigs, while allowing a certain, user-defined, number of mismatches (variation at nucleotide positions that can be due to sequencing error or biological variation).

READ MAPPING: Different types of Aligners

The approach on reference mapping, depends on the source of reads (genomic vs mRNA); and the type of reference (genomic vs transcriptomic).
There are two types of aligners:

Spliced aligners (ideal for RNA Seq)

Unspliced aligners (ideal for genomic sequences)

De novo assembly: WGS and RNA-Seq

Building a de novo assembly is a very memory-intensive process. There are many programs for this, both commandline (Trinity), and those with a GUI (CLC genomics workbench, Geneious)
Trinity is a represents a program to perform efficient and robust de novo reconstruction of transcriptomes from RNA-seq data. Prior to the development of transcriptome assembly computer programs, transcriptome data were analyzed primarily by mapping on to a reference genome. Though genome alignment is a robust way of characterizing transcript sequences, this method is disadvantaged by its inability to account for incidents of structural alterations of mRNA transcripts, such as alternative splicing.
Unlike genome sequence coverage levels – which can vary randomly as a result of repeat content in non-coding intron regions of DNA – transcriptome sequence coverage levels can be directly indicative of gene expression levels. These repeated sequences also create ambiguities in the formation of contigs in genome assembly, while ambiguities in transcriptome assembly contigs usually correspond to spliced isoforms, or minor variation among members of a gene family.
A number of assembly programs are available. Although these programs have been generally successful in assembling genomes, transcriptome assembly presents some unique challenges.

- Whereas high sequence coverage for a genome may indicate the presence of repetitive sequences (and thus be masked), for a transcriptome, they may indicate abundance.
- In addition, unlike genome sequencing, transcriptome sequencing can be strand-specific, due to the possibility of both sense and antisense transcripts.
- Finally, it can be difficult to reconstruct and tease apart all splicing isoforms.

Short read assemblers generally use one of two basic algorithms: overlap graphs and de Bruijn graphs.

Overlap graphs are utilized for most assemblers designed for Sanger sequenced reads. The overlaps between each pair of reads is computed and compiled into a graph, in which each node represents a single sequence read. This algorithm is more computationally intensive than de Bruijn graphs, and most effective in assembling fewer reads with a high degree of overlap.
De Bruijn graphs align k-mers (usually 25-50 bp) based on k-1 sequence conservation to create contigs. The use of k-mers – which are shorter than the read lengths – in de Bruijn graphs reduces the computational intensity of this method.

When comparing the lengths and numbers of contigs acquired from de novo assemblies to the predicted number of transcripts from genome projects, the de novo contigs typically are shorter and more numerous. This is because the assembler cannot join contigs together unless there is enough overlap and coverage in the reads, so that several different contigs will match one mRNA transcript.

Abyss, velvet oases; soap denovo, trinity

Whether we choose to use a de-novo assembly, or a reference guided approach, it is CRUCIAL to remove adapter sequences, and low quality bases.

Select the below links for the appropriate approach for your data

Performing a Reference-Guided mapping

Performing a de-novo assembly

Popular Genome Browsers:

UCSC Genome Browser

Big pluses: data integration.
Negatives: slower (web based), a little more difficult to upload large custom data sets.

IGV (Integrative Genome Viewer)

Visualising using SAMTOOLS

samtools faidx campy.fa

bowtie2sam.pl xxx-pre-1m.map > xxx-pre-1m.map.sam

samtools import xxx.fa.fai xxx-pre-1m.map.sam xxx-pre-1m.map.bam

samtools sort xxx-pre-1m.map.bam xxx-pre-1m.map.sorted

samtools index xxx-pre-1m.map.sorted.bam

samtools view xxx-pre-1m.map.sorted.bam

Others

check out this link.

Types of custom data files

A general list of common file formats can be found here. Popular formats are shown below:

bed - (*.bed) - BED files are very basic as they simply describe a simple region in the genome. They are usually used to describe ChIP-Seq peaks and things of that nature. Nearly every genome browser supports visualization of BED files.
wiggle - (*.wig) - Wiggle files are used to display quantitative information across genomic regions. Usually used to display read depth from ChIP- or RNA-seq experiments. Wiggle format is compact and displays data at regular intervals.
bedGraph - (*.bedGraph) - Similar to Wiggle files, these are used to display quantitative data across genomic regions. They use variable length intervals instead of constant intervals found in wiggle files, and are usually a little bigger in size.
bam - (*.bam) - Display individual reads. Bam files need to be sorted, and they need to have an index file along with the bam file to help the genome browser efficiently find reads in the bam file. It's best to use a local browser like IGV when visualizing bam files.
GFF/GTF - (*.gff *.gtf) - Extensible file formats for specifying spliced transcripts and genes. Transcript assembly programs like cufflinks will generate GTF files that you can then upload to a genome browser.

Server Resident Files: In the case of web-based genome browsers such as UCSC, it can be difficult to upload large data files. To get around this issue, UCSC set up protocols to allow you to post your files on a webserver and then create a track that "points" to the location of your files. This requires a working webserver, but can be a powerful way to visualize bigwig files. Click for More Info

Creating genome browser files

There are several specialized programs for creating genome browser files. Often, the output of a program is already suited to be loaded into a genome browser. For example, cufflinks generates a "transcripts.gtf" file. Macs creates a peak bed file. Other common examples:

Visualize bam alignment files using samtools Visualize read coverage files using BEDtools

Loading custom data into the UCSC Genome Browser

To look at your own data using the UCSC Genome Browser, click on the "Genomes" at the top and look for the "Add Custom Tracks" or "Manage Custom Tracks" button:

UCSC custom track: After uploading your track, the data should appear in the genome browser. At the bottom of the browser image you'll find a variety of track settings. The section at the top controls the settings for custom tracks:

UCSC settings: You can change how the track is visualized by clicking on the drop down menu, shown above. You can also click on the 'blue' link name and change other custom settings.
Visual Quality Control
There are several things to look out for when viewing your data in the browser. Below is a checklist to help guide you.

Look for spikes in the data. These may be caused by contaminants, and may cause problems with data analysis. ChIP-Seq
Are there nice, defined peaks in the data? Or are there regions of continuous coverage (histone marks)?
Are there reads on all expected chromsomes?
Does the pattern match the experiment?
TFs: Spikes of enrichment near the TSS and distal regulatory elements
H3K4me3 - enriched near TSS
H3K4me1/2, H3/H4ac, DNase - enriched near TSS and near distal regulatory elements
H3K36me3 - enriched across gene bodies
H3K27me3 - enriched near CpG Islands of inactive genes
H3K9me3 - enriched across broad domains and repeat elements
Is the background low, or almost as high as the expected peaks? RNA-Seq
Are most reads found on exons? Or is there a lot of reads in introns/other regions?
Do you have even read coverage across exons, or is it full of strong spikes?
Is there a 3' or 5' bias in the data?
If strand specific, is it the correct strand?