Understanding and manipulating SAM/BAM alignment files

Once you have mapped your FASTQ file to the reference genome you will normally end up with a SAM or BAM alignment file.
SAM stands for Sequence Alignment/Map format, and BAM is the binary version of a SAM file.
These formats have become industry standards for reporting alignment/mapping information, so it's good to understand them to some degree.
We will also discuss samtools, which is a free software package for manipulating SAM/BAM files. Understanding how to use samtools is important since BAM files are often the input files needed for many different analysis programs.
** Sometimes these files need to be sorted, or have an index, and samtools is an easy way to create one (illustrated later on this page). BEDtools will also be covered at the end to demonstrate some nifty things you can do once you have a BAM file.

SAM file format

For the most part, you don't really need to understand the SAM format since most analysis software will handle it for you. A full description of the SAM format can be found here. SAM files are tab-delimited text files, and can be opened using a text editor or viewed using the UNIX "more" command. Most files will start with a header section, which has lines that start with "@" and looks like this:
@HD VN:1.0 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
...
The SAM alignment has several columns:
  • Read Name
  • SAM flag
  • chromosome (if read is has no alignment, there will be a "*" here)
  • position (1-based index, "left end of read")
  • MAPQ (mapping quality - describes the uniqueness of the alignment, 0=non-unique, >10 probably unique)
  • CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example)
  • Name of mate (mate pair information for paired-end sequencing, often "=")
  • Position of mate (mate pair information)
  • Template length (always zero for me)
  • Read Sequence
  • Read Quality
  • Program specific Flags (i.e. HI:i:0, MD:Z:66G6, etc.)
Obviously, the chromosome and position are important.
The CIGAR string is also important to know where insertions (i.e. introns) might exist in your read. The other two important values in a SAM file are the MAPQ value (column 5) and the SAM flag (column 2).
The MAPQ value can be used to figure out how unique an alignment is in the genome (large number, >10 indicate it's likely the alignment is unique).
NOTE: Usually, the process of removing duplicate reads or removing non-unique alignments is handled by the downstream analysis program.

Working with SAM files

Even the SAM file isn't very useful unless we can get it into a program that generates more readable output or lets us visualize things in a more intuitive way. We need to get the sam output into a sorted BAM file so we can look at it using Samtools later.
As earlier illustrated, type module load samtools to ensure its available for use.
Like bwa, Samtools also requires us to go through several steps before we have our data in usable form. First, we need to have Samtools generate its own index of the reference genome: samtools faidx [reference.fasta]
Next, we need to convert the SAM file into a BAM file. (A BAM file is just a binary version of a SAM file.): samtools import [reference.fasta.fai] RAL357_bwa.sam RAL357_bwa.bam Now, we need to sort the BAM file:: samtools sort RAL357_bwa.bam RAL357_bwa.sorted And last, we need Samtools to index the BAM file: samtools index RAL357_bwa.sorted.bam All done! Now we can use the sorted BAM file in Samtools to visualize our mappings, generate lists of SNPs, and call consensus sequences.

Other SAMTOOLS utilities


Converting BAM to SAM and vice versa

A common thing you may want to do is view the contents of a BAM file. Since BAM files are a binary format, it is hard to read them with a text editor. To convert a bam file to a sam file:
samtools view -h file.bam > file.sam
To convert back to a bam file:
samtools view -b -S file.sam > file.bam
Sorting a BAM file Many of the downstream analysis programs that use BAM files actually require a sorted BAM file. This allows access to reads to be done more efficiently. To sort a BAM file:
samtools sort -m 1000000000 file.bam outputPrefix
The number of -m specifies the maximum memory to use, and can be changed to fit your system. The output will be placed in file named "outputPrefix.bam".
Generate a FASTQ file from a BAM file
bam2fastq -o aligned.fastq --no-unaligned aln.bam

Creating a BAM index

Some tools require a BAM Index file to more efficiently access reads in a BAM file. An example of this is viewing alignments using the UCSC Genome Browser. To create a BAM index, you must first sort the BAM file. After that, run the following command to make a BAM index:
samtools index sorted.bam
This will create a file named sorted.bam.bai which contains the index.

Using BEDtools to perform basic analysis tasks

BEDtools is a powerful suite of tools that enables you to perform several different types of analysis. In many ways, it's like HOMER in that it is flexible and allows you to analyze your data in many different ways.

Creating a Genome Coverage BedGraph

Experiments such as ChIP-Seq and RNA-Seq are means to measure the density of reads in a given location. To build a density graph of your reads across the genome, use the BEDtools program "genomeCoverageBed" to create a bedGraph file.
genomeCoverageBed -ibam SRR065240.notx.bam -bg -trackline -trackopts 'name="notx" color=250,0,0' > notx.bedGraph
The "-trackline" and "-trackopts" options are used to specify parameters for the bedGraph file so that when you upload it to UCSC or the like it looks good. It's also a good idea to gzip file file after creating it to reduce upload times to UCSC.

Visualizing mapped / aligned reads in a Genome Browser

The ability to visualize your raw sequencing aligned to the genome is enabled through the use of genome browsers. Visualizing your data is incredibly important - it enables you to investigate your data and get a feel for how it "looks". This is incredibly important for quality control - programs such as FASTQC, HOMER, etc. can only do so much. Years of experience and knowledge of biological systems make the human brain a good tool to investigate data quality. Also, visualization of sequencing data may help you come up with new ideas about how to analyze the data.

Click here for viewing the output with TView