Hi,
I work last few years on bioinformatics, but my knowledge is basically limited to the processing of amplicon and metagenomics NGS data. Recently, I started to work with genome assemblies and call variants but I lack the knowledge and experience.
So, I have a couple of genomes from different strains. I aligned these against the reference genome available with bwa-mem, convert the SAM to BAM and sorted the BAM files for each of the genomes that I have. Then, I want to call variants for each of the strain's genomes and then combine them in order to find the unique variants to each strain. What I want is to have a list of chromosome positions of SNPs that is unique for the strains 1, other that is unique for strain 2 and so on.
With this in mind, from what I read, I decided to use GATK to call variants. So until know, I realize that I need to use Picard to fix read groups, index the bam files (all together with
java -jar picard.jar AddOrReplaceReadGroups I=input.bam O=output.bam CREATE_INDEX=true RGID=$ RGSM=$RGSM RGLB=$RGLB RGPL=$RGPL RGPU=$RGPU #(the read groups parameters represent bash variables)
Then, I index and create a dictionary for the reference genome. After this I would like to call the variants with 'HaplotypeCaller' and I read here - https://software.broadinstitute.org/gatk/blog?id=7847 - that local realignment around indels was removed from the 'HaplotypeCaller' workflow, but still it is recommend to perform this step before 'HaplotypeCaller'.
So, I tried to run the 'RealignerTargetCreator' and 'IndelRealigner', the problem is that I don't have a vcf file with known indels for these strains.
So, now I'm running 'HaplotypeCaller' for each strain's genome:
gatk HaplotypeCaller -R $DB -I output.bam -O output.g.vcf -ERC GVCF # output.bam from the above picard command
Then, my intention is to merge all the variants in order to quantify the unique SNPs for each strain:
gatk GenotypeGVCFs -R $DB --variant ${1}.g.vcf --variant ${2}.g.vcf --variant ${3}.g.vcf --variant ${4}.g.vcf --variant ${5}.g.vcf --variant ${6}.g.vcf --variant ${7}.g.vcf --variant ${8}.g.vcf --variant ${9}.g.vcf --variant ${10}.g.vcf -o ${OUT}/merged.vcf
So, my question is: can I run 'HaplotypeCaller' in order to get the chromosome positions of indels and use these as reference for 'RealignerTargetCreator' and 'IndelRealigner'? Or I don't need them when I'm using the 'HaplotypeCaller' and, then, 'GenotypeGVCFs' and I still can get reliable chromosome positions for unique SNPs for each strain?
Thanks in advance!
Sincerely,
António
Thank you @Marks!
I read that to, but I found at this link - https://software.broadinstitute.org/gatk/blog?id=7847 - that we should still do the local realignment, and that's why I asked.
António
I agree it's confusing. That blog is very detailed but from my reading, the take home message is that HaplotypeCaller is much better at resolving indels than UnifiedGenotyper+IndelRealigner. In some cases though, such as low quality libraries, it might be beneficial to use indelrealigner alongside haplotypecaller.
Have a read of the blog again, it's very well written.