Processing files and Running GATK

GATK is a very powerful Variant calling software developed at Broad gatk that is recommended by many researchers. However there are some strict rules that it follows thus resulting in a pipeline that needs several steps. The steps below have been provided by notes from the link above.

The dataset for this example is the same from the SNP analysis. Simply replace the PTC_Human.fasta with your genome fasta sequence and Sherman and Cohen with your samples.

But let's start from the beginning. First we create a bowtie index file so we can align our sample sequences to the reference. We will also create and index and a dictionary which is needed by GATK.

module load bowtie2
module load samtools

bowtie2-build PTC_Human.fasta PTC_Human
samtools faidx PTC_Human.fasta
java -jar /export/apps/picard-tools/1.112/CreateSequenceDictionary.jar R=PTC_Human.fasta O=PTC_Human.dict

The following steps have to executes for all samples

align using Bowtie2
convert to bam
sort the bam
create read groups
remove duplicates
create a bam index for dedup bam files
realign samples
merge all samples to one bam file
sort and index this merged bam file
run gatk

bowtie2 -x PTC_Human -U Cohen.fastq -S Cohen.sam
samtools view -bS Cohen.sam > Cohen.bam
bowtie2 -x PTC_Human -U Sherman.fastq -S Sherman.sam
samtools view -bS Sherman.sam > Sherman.bam

The picard method to sort is preferred by GATK

java -jar /export/apps/picard-tools/1.112/SortSam.jar INPUT=Cohen.bam OUTPUT=Cohen.sorted.bam SORT_ORDER=coordinate
java -jar /export/apps/picard-tools/1.112/SortSam.jar INPUT=Sherman.bam OUTPUT=Sherman.sorted.bam SORT_ORDER=coordinate

now we have to actually create a read group that describes where the reads are coming from. This is required for GATK

java -jar /export/apps/picard-tools/1.112/AddOrReplaceReadGroups.jar INPUT=Sherman.sorted.bam OUTPUT=ShermanRG.bam RGLB=Sherman RGPL=IonTorrent RGPU=None RGSM=Sherman

java -jar /export/apps/picard-tools/1.112/AddOrReplaceReadGroups.jar INPUT=Cohen.sorted.bam OUTPUT=CohenRG.bam RGLB=Cohen RGPL=IonTorrent RGPU=None RGSM=Cohen

This will remove any reads that map to the same exact place. It is helpful to get rid of artifacts.

java -jar /export/apps/picard-tools/1.112/MarkDuplicates.jar INPUT=CohenRG.bam OUTPUT=Cohen.dedup.bam METRICS_FILE=Cohen.dedup.metrics REMOVE_DUPLICATES=TRUE ASSUME_SORTED=TRUE

java -jar /export/apps/picard-tools/1.112/MarkDuplicates.jar INPUT=ShermanRG.bam OUTPUT=Sherman.dedup.bam METRICS_FILE=Sherman.dedup.metrics REMOVE_DUPLICATES=TRUE ASSUME_SORTED=TRUE

Index the files and realign them

samtools index Cohen.dedup.bam 
samtools index Sherman.dedup.bam 

#identifying indels
java -Xmx2g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T RealignerTargetCreator \
   -R PTC_Human.fasta \
   -I Cohen.dedup.bam \
   -o CohenforIndelRealigner.intervals
 
 #identifying indels
java -Xmx2g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T RealignerTargetCreator \
   -R PTC_Human.fasta \
   -I Sherman.dedup.bam \
   -o ShermanforIndelRealigner.intervals

 
 java -Xmx4g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R PTC_Human.fasta \
   -I Cohen.dedup.bam \
  -targetIntervals CohenforIndelRealigner.intervals \
   -o Cohen.dedup.realign.bam

 java -Xmx4g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T IndelRealigner \
   -R PTC_Human.fasta \
   -I Sherman.dedup.bam \
  -targetIntervals ShermanforIndelRealigner.intervals \
   -o Sherman.dedup.realign.bam

Now we merge the bam files and then sort and index them

java -jar /export/apps/picard-tools/1.112/MergeSamFiles.jar INPUT=Sherman.dedup.realign.bam INPUT=Cohen.dedup.realign.bam OUTPUT=ShermanCohenMerged.bam  
samtools sort ShermanCohenMerged.bam ShermanCohenMerged.sorted
samtools index ShermanCohenMerged.sorted.bam

Finall !! run gatk

java -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \
   -T UnifiedGenotyper \
   -I ShermanCohenMerged.sorted.bam \
   -R PTC_Human.fasta \
   --output_mode EMIT_VARIANTS_ONLY \
   -ploidy 2 \
   -glm SNP \
   -o PTC_human.gatk.vcf