Question

How to perform local realignment around indels without a known vcf file of indels with GATK?

0

Entering edit mode

5.5 years ago

antonioggsousa 3.2k

Hi,

I work last few years on bioinformatics, but my knowledge is basically limited to the processing of amplicon and metagenomics NGS data. Recently, I started to work with genome assemblies and call variants but I lack the knowledge and experience.

So, I have a couple of genomes from different strains. I aligned these against the reference genome available with bwa-mem, convert the SAM to BAM and sorted the BAM files for each of the genomes that I have. Then, I want to call variants for each of the strain's genomes and then combine them in order to find the unique variants to each strain. What I want is to have a list of chromosome positions of SNPs that is unique for the strains 1, other that is unique for strain 2 and so on.

With this in mind, from what I read, I decided to use GATK to call variants. So until know, I realize that I need to use Picard to fix read groups, index the bam files (all together with

java -jar picard.jar AddOrReplaceReadGroups I=input.bam O=output.bam CREATE_INDEX=true RGID=$ RGSM=$RGSM RGLB=$RGLB RGPL=$RGPL RGPU=$RGPU #(the read groups parameters represent bash variables)

Then, I index and create a dictionary for the reference genome. After this I would like to call the variants with 'HaplotypeCaller' and I read here - https://software.broadinstitute.org/gatk/blog?id=7847 - that local realignment around indels was removed from the 'HaplotypeCaller' workflow, but still it is recommend to perform this step before 'HaplotypeCaller'.

So, I tried to run the 'RealignerTargetCreator' and 'IndelRealigner', the problem is that I don't have a vcf file with known indels for these strains.

So, now I'm running 'HaplotypeCaller' for each strain's genome:

gatk HaplotypeCaller -R $DB -I output.bam -O output.g.vcf -ERC GVCF # output.bam from the above picard command

Then, my intention is to merge all the variants in order to quantify the unique SNPs for each strain:

gatk GenotypeGVCFs -R $DB --variant ${1}.g.vcf --variant ${2}.g.vcf --variant ${3}.g.vcf --variant ${4}.g.vcf --variant ${5}.g.vcf --variant ${6}.g.vcf --variant ${7}.g.vcf --variant ${8}.g.vcf --variant ${9}.g.vcf --variant ${10}.g.vcf -o ${OUT}/merged.vcf

So, my question is: can I run 'HaplotypeCaller' in order to get the chromosome positions of indels and use these as reference for 'RealignerTargetCreator' and 'IndelRealigner'? Or I don't need them when I'm using the 'HaplotypeCaller' and, then, 'GenotypeGVCFs' and I still can get reliable chromosome positions for unique SNPs for each strain?

Thanks in advance!

Sincerely,

António

SNP next-gen • 4.2k views

ADD COMMENT • link updated 5.5 years ago by Mark ★ 1.6k • written 5.5 years ago by antonioggsousa 3.2k

score 6 · Accepted Answer · 2019-10-31

6

Entering edit mode

5.5 years ago

Mark ★ 1.6k

GATK introduced the variant caller HaplotypeCaller which handles the preprocessing like indelrealignment: https://software.broadinstitute.org/gatk/blog?id=7847

In other words, no you do not need to use those realignment tools in GATK4. The documentation reflects this:

Note that indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2. However it is still required when using legacy callers such as UnifiedGenotyper or the original MuTect.

P.S This is an overview of all steps required to run the GATK best practices variant calling pipeline: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11145

ADD COMMENT • link 5.5 years ago by Mark ★ 1.6k

0

Entering edit mode

Thank you @Marks!

I read that to, but I found at this link - https://software.broadinstitute.org/gatk/blog?id=7847 - that we should still do the local realignment, and that's why I asked.

António

ADD REPLY • link 5.5 years ago by antonioggsousa 3.2k

0

Entering edit mode

I agree it's confusing. That blog is very detailed but from my reading, the take home message is that HaplotypeCaller is much better at resolving indels than UnifiedGenotyper+IndelRealigner. In some cases though, such as low quality libraries, it might be beneficial to use indelrealigner alongside haplotypecaller.

Have a read of the blog again, it's very well written.

ADD REPLY • link 5.5 years ago by Mark ★ 1.6k