Hi everyone, This is the first time that i am going to use gatk. So i have some questions related to realignment targets. I have paired end illumina sequenicng data of 8 species from the 3 lanes.
lane 1 in flowcell contains 8 species that generate R1 and R2 for each species those had been loaded on the lane 2, lane 3 as well. Then i merged R1 and R2 of each species from lane 1, 2 and 3 separately. For snp analysis, i am going to use this steps:
#Index reference
bwa index reference.fasta
#Sort reference
samtools faidx reference.fasta
#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict
#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam
#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate
#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt
#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam
#Create realignment targets
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list
#Indel realignment
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam
#Call variants (HaplotypeCaller)
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf
The resulting vcf file will contain your variant calls!
Then you can optionally filter the variants:
#Filter variants
~/bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf
Or first split the raw.vcf file into SNPs and indels:
#Extract SNPs
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf
#Extract Indels
java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf
When i come to RealignerTargetCreator then it show that SAM file doesn't have any read groups defined in the header.Then i found that i should run AddOrReplaceReadGroups but i did not get at which steps i should run this ?? and what does exactly it do. i read about that but did get too much.
So please anyone can tell me about that.
Thanks in Advance
Is there a reason you set ploidy to 1? Is there a reason you changed the
-stand_call_conf
and-stand_emit_conf
options?BTW, you don't need to realign around indels (see the GATK best practices).
Thanks For your reply; Can you please tell me that at which condition we should merge vcf file ?? because in many discussion i have read that ppl merged their bam files and also vcf file but i did not get when it will good. I have 8 different species after mapping, i have different 8 bam files. Now i want to check variants in all species so i will get 8 vcf files after running GATK. So, where will be the logic of merging bam and vcf files apply ??
Merging BAM files was used for the UnifiedGenotyper, which is no longer the best practice. You can merge VCF files whenever you need all of the samples together or when you start with gVCF and want to utilize all samples for the final calls.
Ohk, can you suggest me some post or blog or papers where they explain about these things ?? Thanks
The GATK website and forum should be your go-to resources.