BecA Bioinformatics

Quick Links

Useful Resources

Transcriptome mapping

Mapping refers to the process of aligning short reads to a reference sequence, whether the reference is a complete genome, transcriptome, or de novo assembly. There are numerous programs that have been developed to map reads to a reference sequence that vary in their algorithms and therefore speed (see Flicek and Birney 2009 and references therein).
In this section, we will map the reads from each of your cleaned and trimmed FASTQ files to the de novo reference assembly that you previously created (or the predicted genes if your study species has a genome sequence). Specifically, we will
1) create an index for the reference assembly (just once),
2) for each sample, map reads to the reference assembly, and
3) convert the resulting file into the SAM file format and append "read group" names to the SAM file for each sample.
Below are guideliens for reference mapping:

Using Star
Using Tophat
Using CLC Genomics workbench

Overview

The goal of mapping is to create an alignment file also known as a Sequence/Alignment Map (SAM) file for each of your samples. This SAM file will contain one line for each of the reads in your sample denoting the reference sequence (genes, contigs, or gene regions) to which it maps, the position in the reference sequence, and a Phred-scaled quality score of the mapping, among other details (Li et al. 2009).
You will use the SAM files for your samples to extract gene expression information (the number of reads that map to each reference sequence) and to identify polymorphisms across your data.

Mapping to a transcriptome reference

If are working with an organism for which there is a reference transcriptome available, you can use the gene annotations to pull out sequences coding for mRNA and use those as the reference for further processing. Download these in the FASTA format and skip to Mapping to reference. This read mapping is similar to genomic reference mapping, since the introns are non-existent in the reference transcriptome.

See reference mapping guide here

Spliced aligning (to Genomic reference) for RNA-Seq

RNA-Seq reads represent short pieces of all the mRNA present in the tissue at the time of sampling. In order to be useful, the reads need to be combined (assembled) into larger fragments, each representing an mRNA transcript. These combined sequences are called "contigs", which is short for "contiguous sequences".
Spliced assemblers are used to align sequence reads to a reference genome.

Note: Since a genome contains the sum of all introns and exons that may be present in a transcript, spliced variants that do not align continuously along the genome may be discounted as actual protein isoforms. Usually, any kind of RNA-seq method will benefit from looking for splicing junctions in addition to genomic mapping:

Transcriptome mapping using STAR

STAR is one of a growing number of short read aligners that takes advantage of advances in computational power to optimize the short read mapping process (original publication: Dobin et al.) The key limitation with STAR is computer RAM - STAR requires at least 30 Gb to align to the human or mouse genomes.

Build a genome index

STAR --runMode genomeGenerate --runThreadN [# cpus] --genomeDir [genome output directory] --genomeFastaFiles [input Genome FASTA file]

Note:

"--genomeSAindexNbases 6"

Align RNA-Seq Reads to the genome with STAR

STAR --genomeDir [Directory with the Genome Index] --runThreadN [# cpus] --readFilesIn [FASTQ file] --outFileNamePrefix [OutputPrefix]

# paried-end data:

STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn read1.fastq read2.fastq --outFileNamePrefix Experiment1Star

Notes on STAR

Transcriptome mapping using Tophat

Tophat is basically a specialized wrapper for bowtie2 - it manipulates your reads and aligns them with bowtie2 in order to identify novel splice junctions. It can also use given splice junctions/gene models to map your data across known splice junctions.

Build Index (takes a while, but only do this once):
- Align your RNA-seq data to the genome using Tophat

tophat -o [output directory] -p [# cpu] [/path-to-genome-index/prefix] [fastq file]

Paired-end Example:

In the example above, we use 8 processors/threads. The tophat2 program contains a mix of serial and parallel routines, so more processors doesn't necessarily mean it will be finished much faster. In particular, if you are performing coverage based searchers, it may take a long time (multiple processors will not help). In general, if you have multiple samples to map, it's better to map them at the same time with separate commands rather than mapping them one at a time with mapping processors.
Tophat places several output files in an output directory. The most important is the "accepted_hits.bam" file - this is the alignment file that will be used for future analysis (more info here). There are additional files that can be useful, including a "junctions.bed" file with records all of the splice junctions discovered in the data.

Important Tophat Parameters:

--library-type [fr-unstranded | fr-firststrand | fr-secondstrand]
-G [GTF file]

Gene Transfer Files

Tophat Mapping Strategy

If your goal is to identify novel transcripts and you have several separate experiments, it is recommended pooling all of your data together into a single expeirment/FASTQ file and mapping your data in one run. This is because one of the ways Tophat tries to identify novel junctions is by first identifying "exons" by mapping segments of reads to the genome using bowtie2. The more segments it's able to map, the more confident it is about putative exons and the greater the chance it will identify unannotated splice sites.
You can then go back with your novel splice sites and remap the original experiments (not pooled) to get reads for each individual experiment using the "-j [raw junction file]". This is most useful for short reads. This general strategy is also useful if running cufflinks...