Back to Manny's Bioinformatics Workshop Home

Processing files and Running GATK

GATK is a very powerful Variant calling software developed at Broad gatk that is recommended by many researchers. However there are some strict rules that it follows thus resulting in a pipeline that needs several steps. The steps below have been provided by notes from the link above.

The dataset for this example is the same from the SNP analysis. Simply replace the PTC_Human.fasta with your genome fasta sequence and Sherman and Cohen with your samples.

But let's start from the beginning. First we create a bowtie index file so we can align our sample sequences to the reference. We will also create and index and a dictionary which is needed by GATK.

module load bowtie2
module load samtools
module load picard

bowtie2-build PTC_Human.fasta PTC_Human
samtools faidx PTC_Human.fasta
picard CreateSequenceDictionary \
   R=PTC_Human.fasta \
   O=PTC_Human.dict

The following steps have to executes for all samples

Once they are all processed

bowtie2 -x PTC_Human -U Sample1.fastq -S Sample1.sam
samtools view -bS Sample1.sam > Sample1.bam

bowtie2 -x PTC_Human -U Sample2.fastq -S Sample2.sam
samtools view -bS Sample2.sam > Sample2.bam

bowtie2 -x PTC_Human -U Sample3.fastq -S Sample3.sam
samtools view -bS Sample3.sam > Sample3.bam

bowtie2 -x PTC_Human -U Sample4.fastq -S Sample4.sam
samtools view -bS Sample4.sam > Sample4.bam

The picard method to sort is preferred by GATK. In some cases PICARD uses the temp directory to do its sorting. You may run into an error that complains about running out of space. To avoid this problem simply create your own tmp directory and tell java that it should use it. See details here.

module load picard/1.133

picard SortSam INPUT=Sample1.bam OUTPUT=Sample1.sorted.bam \
    SORT_ORDER=coordinate

picard SortSam INPUT=Sample2.bam OUTPUT=Sample2.sorted.bam \
    SORT_ORDER=coordinate

picard SortSam INPUT=Sample3.bam OUTPUT=Sample3.sorted.bam \
    SORT_ORDER=coordinate

picard SortSam INPUT=Sample4.bam OUTPUT=Sample4.sorted.bam \
    SORT_ORDER=coordinate

now we have to actually create a read group that describes where the reads are coming from. This is required for GATK

picard AddOrReplaceReadGroups \
   INPUT=Sample1.sorted.bam \
   OUTPUT=Sample1RG.bam \
   RGLB=Sample1 \
   RGPL=IonTorrent \
   RGPU=None \
   RGSM=Sample1

picard AddOrReplaceReadGroups \
   INPUT=Sample2.sorted.bam \
   OUTPUT=Sample2RG.bam \
   RGLB=Sample2 \
   RGPL=IonTorrent \
   RGPU=None \
   RGSM=Sample2

picard AddOrReplaceReadGroups \
   INPUT=Sample3.sorted.bam \
   OUTPUT=Sample3RG.bam \
   RGLB=Sample3 \
   RGPL=IonTorrent \
   RGPU=None \
   RGSM=Sample3

picard AddOrReplaceReadGroups \
   INPUT=Sample4.sorted.bam \
   OUTPUT=Sample4RG.bam \
   RGLB=Sample4 \
   RGPL=IonTorrent \
   RGPU=None \
   RGSM=Sample4

This will remove any reads that map to the same exact place. It is helpful to get rid of artifacts.

picard MarkDuplicates \
   INPUT=Sample1RG.bam \
   OUTPUT=Sample1.dedup.bam \
   METRICS_FILE=Sample1.dedup.metrics \
   REMOVE_DUPLICATES=TRUE \
   ASSUME_SORTED=TRUE \
   MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000


picard MarkDuplicates \
   INPUT=Sample2RG.bam \
   OUTPUT=Sample2.dedup.bam \
   METRICS_FILE=Sample2.dedup.metrics \
   REMOVE_DUPLICATES=TRUE \
   ASSUME_SORTED=TRUE \
   MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000

picard MarkDuplicates \
   INPUT=Sample3RG.bam \
   OUTPUT=Sample3.dedup.bam \
   METRICS_FILE=Sample3.dedup.metrics \
   REMOVE_DUPLICATES=TRUE \
   ASSUME_SORTED=TRUE \
   MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000

picard MarkDuplicates \
   INPUT=Sample4RG.bam \
   OUTPUT=Sample4.dedup.bam \
   METRICS_FILE=Sample4.dedup.metrics \
   REMOVE_DUPLICATES=TRUE \
   ASSUME_SORTED=TRUE \
   MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000

Index the files and realign them

samtools index Sample1.dedup.bam 
samtools index Sample2.dedup.bam 
samtools index Sample3.dedup.bam 
samtools index Sample4.dedup.bam 

#identifying indels

module load gatk/3.3.0

GenomeAnalysisTK \
   -T RealignerTargetCreator \
   -R PTC_Human.fasta \
   -I Sample1.dedup.bam \
   -o Sample1forIndelRealigner.intervals
 

GenomeAnalysisTK \
   -T IndelRealigner \
   -R PTC_Human.fasta \
   -I Sample1.dedup.bam \
   -targetIntervals Sample1forIndelRealigner.intervals \
   -o Sample1.dedup.realign.bam

GenomeAnalysisTK \
   -T RealignerTargetCreator \
   -R PTC_Human.fasta \
   -I Sample2.dedup.bam \
   -o Sample2forIndelRealigner.intervals 

GenomeAnalysisTK \
   -T IndelRealigner \
   -R PTC_Human.fasta \
   -I Sample2.dedup.bam \
   -targetIntervals Sample2forIndelRealigner.intervals \
   -o Sample2.dedup.realign.bam


GenomeAnalysisTK \
   -T RealignerTargetCreator \
   -R PTC_Human.fasta \
   -I Sample3.dedup.bam \
   -o Sample3forIndelRealigner.intervals
 
GenomeAnalysisTK \
   -T IndelRealigner \
   -R PTC_Human.fasta \
   -I Sample3.dedup.bam \
   -targetIntervals Sample3forIndelRealigner.intervals \
   -o Sample3.dedup.realign.bam

GenomeAnalysisTK \
   -T RealignerTargetCreator \
   -R PTC_Human.fasta \
   -I Sample4.dedup.bam \
   -o Sample4forIndelRealigner.intervals
 

GenomeAnalysisTK \
   -T IndelRealigner \
   -R PTC_Human.fasta \
   -I Sample4.dedup.bam \
   -targetIntervals Sample4forIndelRealigner.intervals \
   -o Sample4.dedup.realign.bam

In some cases there may be a need to clean the sam/bam file(s) (soft-trimming the coordinates). To do this use CleanSam in Picard tools. You may want to just do it to all to avoid the error in a workflow, but it may not be necessary.

picard CleanSam \
   INPUT=Sample4.dedup.realign.bam \
   OUTPUT=Sample4.clean.dedup.realign.bam

Now we merge the bam files and then sort and index them. If you cleaned the bam file, remember to use the cleaned ones.

picard MergeSamFiles \
   INPUT=Sample1.dedup.realign.bam \
   INPUT=Sample2.dedup.realign.bam \
   INPUT=Sample3.dedup.realign.bam \
   INPUT=Sample4.dedup.realign.bam \
   OUTPUT=AllMerged.bam  

picard SortSam INPUT=AllMerged.bam OUTPUT=AllMerged.sorted.bam SORT_ORDER=coordinate

samtools index AllMerged.sorted.bam

Finally !! run gatk

GenomeAnalysisTK -T UnifiedGenotyper \
   -I AllMerged.sorted.bam \
   -R PTC_Human.fasta \
   --output_mode EMIT_VARIANTS_ONLY \
   -ploidy 2 \
   -glm SNP \
   -o PTC_human.gatk.vcf

If you want to load the vcf file into IGV, remember to index it first.

module load igvtools
igvtools index PTC_human.gatk.vcf

If you would like to generate a table of from the vcf file use the following command

GenomeAnalysisTK \
     -R PTC_Human.fasta \
     -T VariantsToTable \
     -V PTC_human.gatk.vcf \
     -F CHROM -F POS -F ID -F QUAL -F AC \
     -GF GT -GF GQ \
     -o PTC_human.gatk.vcf.table

In order to filter your vcf file based on quality measures, depth, and also statistical significance, you can use variant filter option in the gatk toolkit. Below is an example of suggested filters for data that has low coverage.

GenomeAnalysisTK \
    -R PTC_Human.fasta \
    -T VariantFiltration \
    -o PTC_human.gatk.filter.vcf \
    --variant PTC_human.gatk.vcf \
    --filterExpression "QD<2.0||MQ<40.0||FS>60.0||HaplotypeScore>13.0" \
    --filterName mannyfilter

Good descriptions of the different information on vcf files GATK Docs

Finally to save the SNPs that passed your filter, you simply use the selectvariant tool.

GenomeAnalysisTK \
    -T SelectVariants \
    --variant PTC_human.gatk.filter.vcf \
    -o PTC_human.gatk.filter.only.vcf \
    -ef \
    -R PTC_Human.fasta