Processing files and Running GATK

GATK is a very powerful Variant calling software developed at Broad gatk that is recommended by many researchers. However there are some strict rules that it follows thus resulting in a pipeline that needs several steps. The steps below have been provided by notes from the link above.

The dataset for this example is the same from the SNP analysis. Simply replace the PTC_Human.fasta with your genome fasta sequence and Sherman and Cohen with your samples.

But let's start from the beginning. First we create a bowtie index file so we can align our sample sequences to the reference. We will also create and index and a dictionary which is needed by GATK.

module load bowtie2
module load samtools

bowtie2-build PTC_Human.fasta PTC_Human
samtools faidx PTC_Human.fasta
java -jar /export/apps/picard-tools/1.112/CreateSequenceDictionary.jar \
   R=PTC_Human.fasta \
   O=PTC_Human.dict

The following steps have to executes for all samples

align using Bowtie2
convert to bam
sort the bam
create read groups
remove duplicates
create a bam index for dedup bam files
realign samples

Once they are all processed

merge all samples to one bam file
sort and index this merged bam file
run gatk

bowtie2 -x PTC_Human -U Cohen.fastq -S Cohen.sam
samtools view -bS Cohen.sam > Cohen.bam

The picard method to sort is preferred by GATK

java -jar /export/apps/picard-tools/1.112/SortSam.jar \
   INPUT=Cohen.bam \
   OUTPUT=Cohen.sorted.bam \
   SORT_ORDER=coordinate

now we have to actually create a read group that describes where the reads are coming from. This is required for GATK

java -jar /export/apps/picard-tools/1.112/AddOrReplaceReadGroups.jar \
   INPUT=Cohen.sorted.bam \
   OUTPUT=CohenRG.bam \
   RGLB=Cohen \
   RGPL=IonTorrent \
   RGPU=None \
   RGSM=Cohen

This will remove any reads that map to the same exact place. It is helpful to get rid of artifacts.

java -jar /export/apps/picard-tools/1.112/MarkDuplicates.jar \
   INPUT=CohenRG.bam \
   OUTPUT=Cohen.dedup.bam \
   METRICS_FILE=Cohen.dedup.metrics \
   REMOVE_DUPLICATES=TRUE \
   ASSUME_SORTED=TRUE

Index the files and realign them

samtools index Cohen.dedup.bam 

#identifying indels
java -Xmx2g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T RealignerTargetCreator \
   -R PTC_Human.fasta \
   -I Cohen.dedup.bam \
   -o CohenforIndelRealigner.intervals
 

 
 java -Xmx4g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R PTC_Human.fasta \
   -I Cohen.dedup.bam \
  -targetIntervals CohenforIndelRealigner.intervals \
   -o Cohen.dedup.realign.bam

In some cases there may be a need to clean the sam/bam file(s) (soft-trimming the coordinates). To do this use CleanSam in Picard tools. You may want to just do it to all to avoid the error in a workflow, but it may not be necessary.

java -jar /export/apps/picard-tools/1.112/CleanSam.jar \
   INPUT=Sherman.dedup.realign.bam \
   OUTPUT=Sherman.clean.dedup.realign.bam

Now we merge the bam files and then sort and index them. If you cleaned the bam file, remember to use the cleaned ones.

java -jar /export/apps/picard-tools/1.112/MergeSamFiles.jar \
   INPUT=Sherman.clean.dedup.realign.bam \
   INPUT=Cohen.dedup.realign.bam \
   OUTPUT=ShermanCohenMerged.bam  

samtools sort ShermanCohenMerged.bam ShermanCohenMerged.sorted

samtools index ShermanCohenMerged.sorted.bam

Finally !! run gatk

java -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T UnifiedGenotyper \
   -I ShermanCohenMerged.sorted.bam \
   -R PTC_Human.fasta \
   --output_mode EMIT_VARIANTS_ONLY \
   -ploidy 2 \
   -glm SNP \
   -o PTC_human.gatk.vcf

If you want to load the vcf file into IGV, remember to index it first.

module load igvtools
igvtools index PTC_human.gatk.vcf

If you would like to generate a table of from the vcf file use the following command

java -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
     -R PTC_Human.fasta
     -T VariantsToTable \
     -V PTC_human.gatk.vcf \
     -F CHROM -F POS -F ID -F QUAL -F AC \
     -GF GT -GF GQ \
     -o PTC_human.gatk.vcf.table