This is an old revision of the document!
Back to Manny's Bioinformatics Workshop Home
Processing files and Running GATK
GATK is a very powerful Variant calling software developed at Broad gatk that is recommended by many researchers. However there are some strict rules that it follows thus resulting in a pipeline that needs several steps. The steps below have been provided by notes from the link above.
The dataset for this example is the same from the SNP analysis. Simply replace the PTC_Human.fasta with your genome fasta sequence and Sherman and Cohen with your samples.
But let's start from the beginning. First we create a bowtie index file so we can align our sample sequences to the reference. We will also create and index and a dictionary which is needed by GATK.
module load bowtie2 module load samtools bowtie2-build PTC_Human.fasta PTC_Human samtools faidx PTC_Human.fasta java -jar /export/apps/picard-tools/1.112/CreateSequenceDictionary.jar R=PTC_Human.fasta O=PTC_Human.dict
The following steps have to executes for all samples
- align using Bowtie2
- convert to bam
- sort the bam
- create read groups
- remove duplicates
- create a bam index for dedup bam files
- realign samples
- merge all samples to one bam file
- sort and index this merged bam file
- run gatk
bowtie2 -x PTC_Human -U Cohen.fastq -S Cohen.sam samtools view -bS Cohen.sam > Cohen.bam bowtie2 -x PTC_Human -U Sherman.fastq -S Sherman.sam samtools view -bS Sherman.sam > Sherman.bam
The picard method to sort is preferred by GATK
java -jar /export/apps/picard-tools/1.112/SortSam.jar INPUT=Cohen.bam OUTPUT=Cohen.sorted.bam SORT_ORDER=coordinate java -jar /export/apps/picard-tools/1.112/SortSam.jar INPUT=Sherman.bam OUTPUT=Sherman.sorted.bam SORT_ORDER=coordinate
now we have to actually create a read group that describes where the reads are coming from. This is required for GATK
java -jar /export/apps/picard-tools/1.112/AddOrReplaceReadGroups.jar INPUT=Sherman.sorted.bam OUTPUT=ShermanRG.bam RGLB=Sherman RGPL=IonTorrent RGPU=None RGSM=Sherman java -jar /export/apps/picard-tools/1.112/AddOrReplaceReadGroups.jar INPUT=Cohen.sorted.bam OUTPUT=CohenRG.bam RGLB=Cohen RGPL=IonTorrent RGPU=None RGSM=Cohen
This will remove any reads that map to the same exact place. It is helpful to get rid of artifacts.
java -jar /export/apps/picard-tools/1.112/MarkDuplicates.jar INPUT=CohenRG.bam OUTPUT=Cohen.dedup.bam METRICS_FILE=Cohen.dedup.metrics REMOVE_DUPLICATES=TRUE ASSUME_SORTED=TRUE java -jar /export/apps/picard-tools/1.112/MarkDuplicates.jar INPUT=ShermanRG.bam OUTPUT=Sherman.dedup.bam METRICS_FILE=Sherman.dedup.metrics REMOVE_DUPLICATES=TRUE ASSUME_SORTED=TRUE
Index the files and realign them
samtools index Cohen.dedup.bam samtools index Sherman.dedup.bam #identifying indels java -Xmx2g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R PTC_Human.fasta \ -I Cohen.dedup.bam \ -o CohenforIndelRealigner.intervals #identifying indels java -Xmx2g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R PTC_Human.fasta \ -I Sherman.dedup.bam \ -o ShermanforIndelRealigner.intervals java -Xmx4g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R PTC_Human.fasta \ -I Cohen.dedup.bam \ -targetIntervals CohenforIndelRealigner.intervals \ -o Cohen.dedup.realign.bam java -Xmx4g -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \ -T IndelRealigner \ -R PTC_Human.fasta \ -I Sherman.dedup.bam \ -targetIntervals ShermanforIndelRealigner.intervals \ -o Sherman.dedup.realign.bam
Now we merge the bam files and then sort and index them
java -jar /export/apps/picard-tools/1.112/MergeSamFiles.jar INPUT=Sherman.dedup.realign.bam INPUT=Cohen.dedup.realign.bam OUTPUT=ShermanCohenMerged.bam samtools sort ShermanCohenMerged.bam ShermanCohenMerged.sorted samtools index ShermanCohenMerged.sorted.bam
Finall !! run gatk
java -jar /export/apps/GenomeAnalysisTK/GenomeAnalysisTK-2.3-9-ge5ebf34/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -I ShermanCohenMerged.sorted.bam \ -R PTC_Human.fasta \ --output_mode EMIT_VARIANTS_ONLY \ -ploidy 2 \ -glm SNP \ -o PTC_human.gatk.vcf