Back to Manny's Bioinformatics Workshop Home
Processing files and Running GATK
GATK is a very powerful Variant calling software developed at Broad gatk that is recommended by many researchers. However there are some strict rules that it follows thus resulting in a pipeline that needs several steps. The steps below have been provided by notes from the link above.
The dataset for this example is the same from the SNP analysis. Simply replace the PTC_Human.fasta with your genome fasta sequence and Sherman and Cohen with your samples.
But let's start from the beginning. First we create a bowtie index file so we can align our sample sequences to the reference. We will also create and index and a dictionary which is needed by GATK.
module load bowtie2 module load samtools module load picard bowtie2-build PTC_Human.fasta PTC_Human samtools faidx PTC_Human.fasta picard CreateSequenceDictionary \ R=PTC_Human.fasta \ O=PTC_Human.dict
The following steps have to executes for all samples
- align using Bowtie2
- convert to bam
- sort the bam
- create read groups
- remove duplicates
- create a bam index for dedup bam files
- realign samples
Once they are all processed
- merge all samples to one bam file
- sort and index this merged bam file
- run gatk
bowtie2 -x PTC_Human -U Sample1.fastq -S Sample1.sam samtools view -bS Sample1.sam > Sample1.bam bowtie2 -x PTC_Human -U Sample2.fastq -S Sample2.sam samtools view -bS Sample2.sam > Sample2.bam bowtie2 -x PTC_Human -U Sample3.fastq -S Sample3.sam samtools view -bS Sample3.sam > Sample3.bam bowtie2 -x PTC_Human -U Sample4.fastq -S Sample4.sam samtools view -bS Sample4.sam > Sample4.bam
The picard method to sort is preferred by GATK. In some cases PICARD uses the temp directory to do its sorting. You may run into an error that complains about running out of space. To avoid this problem simply create your own tmp directory and tell java that it should use it. See details here.
module load picard/1.133 picard SortSam INPUT=Sample1.bam OUTPUT=Sample1.sorted.bam \ SORT_ORDER=coordinate picard SortSam INPUT=Sample2.bam OUTPUT=Sample2.sorted.bam \ SORT_ORDER=coordinate picard SortSam INPUT=Sample3.bam OUTPUT=Sample3.sorted.bam \ SORT_ORDER=coordinate picard SortSam INPUT=Sample4.bam OUTPUT=Sample4.sorted.bam \ SORT_ORDER=coordinate
now we have to actually create a read group that describes where the reads are coming from. This is required for GATK
picard AddOrReplaceReadGroups \ INPUT=Sample1.sorted.bam \ OUTPUT=Sample1RG.bam \ RGLB=Sample1 \ RGPL=IonTorrent \ RGPU=None \ RGSM=Sample1 picard AddOrReplaceReadGroups \ INPUT=Sample2.sorted.bam \ OUTPUT=Sample2RG.bam \ RGLB=Sample2 \ RGPL=IonTorrent \ RGPU=None \ RGSM=Sample2 picard AddOrReplaceReadGroups \ INPUT=Sample3.sorted.bam \ OUTPUT=Sample3RG.bam \ RGLB=Sample3 \ RGPL=IonTorrent \ RGPU=None \ RGSM=Sample3 picard AddOrReplaceReadGroups \ INPUT=Sample4.sorted.bam \ OUTPUT=Sample4RG.bam \ RGLB=Sample4 \ RGPL=IonTorrent \ RGPU=None \ RGSM=Sample4
This will remove any reads that map to the same exact place. It is helpful to get rid of artifacts.
picard MarkDuplicates \ INPUT=Sample1RG.bam \ OUTPUT=Sample1.dedup.bam \ METRICS_FILE=Sample1.dedup.metrics \ REMOVE_DUPLICATES=TRUE \ ASSUME_SORTED=TRUE \ MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 picard MarkDuplicates \ INPUT=Sample2RG.bam \ OUTPUT=Sample2.dedup.bam \ METRICS_FILE=Sample2.dedup.metrics \ REMOVE_DUPLICATES=TRUE \ ASSUME_SORTED=TRUE \ MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 picard MarkDuplicates \ INPUT=Sample3RG.bam \ OUTPUT=Sample3.dedup.bam \ METRICS_FILE=Sample3.dedup.metrics \ REMOVE_DUPLICATES=TRUE \ ASSUME_SORTED=TRUE \ MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 picard MarkDuplicates \ INPUT=Sample4RG.bam \ OUTPUT=Sample4.dedup.bam \ METRICS_FILE=Sample4.dedup.metrics \ REMOVE_DUPLICATES=TRUE \ ASSUME_SORTED=TRUE \ MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
Index the files and realign them
samtools index Sample1.dedup.bam samtools index Sample2.dedup.bam samtools index Sample3.dedup.bam samtools index Sample4.dedup.bam #identifying indels module load gatk/3.3.0 GenomeAnalysisTK \ -T RealignerTargetCreator \ -R PTC_Human.fasta \ -I Sample1.dedup.bam \ -o Sample1forIndelRealigner.intervals GenomeAnalysisTK \ -T IndelRealigner \ -R PTC_Human.fasta \ -I Sample1.dedup.bam \ -targetIntervals Sample1forIndelRealigner.intervals \ -o Sample1.dedup.realign.bam GenomeAnalysisTK \ -T RealignerTargetCreator \ -R PTC_Human.fasta \ -I Sample2.dedup.bam \ -o Sample2forIndelRealigner.intervals GenomeAnalysisTK \ -T IndelRealigner \ -R PTC_Human.fasta \ -I Sample2.dedup.bam \ -targetIntervals Sample2forIndelRealigner.intervals \ -o Sample2.dedup.realign.bam GenomeAnalysisTK \ -T RealignerTargetCreator \ -R PTC_Human.fasta \ -I Sample3.dedup.bam \ -o Sample3forIndelRealigner.intervals GenomeAnalysisTK \ -T IndelRealigner \ -R PTC_Human.fasta \ -I Sample3.dedup.bam \ -targetIntervals Sample3forIndelRealigner.intervals \ -o Sample3.dedup.realign.bam GenomeAnalysisTK \ -T RealignerTargetCreator \ -R PTC_Human.fasta \ -I Sample4.dedup.bam \ -o Sample4forIndelRealigner.intervals GenomeAnalysisTK \ -T IndelRealigner \ -R PTC_Human.fasta \ -I Sample4.dedup.bam \ -targetIntervals Sample4forIndelRealigner.intervals \ -o Sample4.dedup.realign.bam
In some cases there may be a need to clean the sam/bam file(s) (soft-trimming the coordinates). To do this use CleanSam in Picard tools. You may want to just do it to all to avoid the error in a workflow, but it may not be necessary.
picard CleanSam \ INPUT=Sample4.dedup.realign.bam \ OUTPUT=Sample4.clean.dedup.realign.bam
Now we merge the bam files and then sort and index them. If you cleaned the bam file, remember to use the cleaned ones.
picard MergeSamFiles \ INPUT=Sample1.dedup.realign.bam \ INPUT=Sample2.dedup.realign.bam \ INPUT=Sample3.dedup.realign.bam \ INPUT=Sample4.dedup.realign.bam \ OUTPUT=AllMerged.bam picard SortSam INPUT=AllMerged.bam OUTPUT=AllMerged.sorted.bam SORT_ORDER=coordinate samtools index AllMerged.sorted.bam
Finally !! run gatk
GenomeAnalysisTK -T UnifiedGenotyper \ -I AllMerged.sorted.bam \ -R PTC_Human.fasta \ --output_mode EMIT_VARIANTS_ONLY \ -ploidy 2 \ -glm SNP \ -o PTC_human.gatk.vcf
If you want to load the vcf file into IGV, remember to index it first.
module load igvtools igvtools index PTC_human.gatk.vcf
If you would like to generate a table of from the vcf file use the following command
GenomeAnalysisTK \ -R PTC_Human.fasta \ -T VariantsToTable \ -V PTC_human.gatk.vcf \ -F CHROM -F POS -F ID -F QUAL -F AC \ -GF GT -GF GQ \ -o PTC_human.gatk.vcf.table
In order to filter your vcf file based on quality measures, depth, and also statistical significance, you can use variant filter option in the gatk toolkit. Below is an example of suggested filters for data that has low coverage.
GenomeAnalysisTK \ -R PTC_Human.fasta \ -T VariantFiltration \ -o PTC_human.gatk.filter.vcf \ --variant PTC_human.gatk.vcf \ --filterExpression "QD<2.0||MQ<40.0||FS>60.0||HaplotypeScore>13.0" \ --filterName mannyfilter
Good descriptions of the different information on vcf files GATK Docs
Finally to save the SNPs that passed your filter, you simply use the selectvariant tool.
GenomeAnalysisTK \ -T SelectVariants \ --variant PTC_human.gatk.filter.vcf \ -o PTC_human.gatk.filter.only.vcf \ -ef \ -R PTC_Human.fasta