Strategy in WGS variant calling using GATK (CombineGVCF step)
1
0
Entering edit mode
6.6 years ago
prasundutta87 ▴ 670

Hello,

It is a known fact that GATK does not scale or work very well when a draft assembly has a lot of contigs. Mine has 366989 (water buffalo reference genome). I have successfully generated GVCFs of all my 81 samples for the whole genome through Haplotypecaller without any errors. It usually fails or freezes in the CombineGVCF stage. I have thought of a strategy and would like to know if there may be any pitfalls or any important point I need to consider.

Since, I am only interested in my 'genes of interest', and genes are generally not present in all my contigs, I select only those lines from my GVCF files that belong to those contigs and leave the rest. It reduced my total no. of contigs to 4338. (Used this command to find that- cat GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.revised.gtf|cut -f 1|uniq|sort >contigs_with_genes.txt)

I then split the 4338 contigs to sets of 23 contigs (because GATK has been made keeping humans in mind), split each GVCF files into 23 smaller GVCF files and then run CombineGVCF on my 81 samples.

I will end up with 23 combined GVCF files and I can run GenotypeGVCF on them to get final VCF file which I can merge vertically to get my final VCF file.

PS. I cannot run GenomicsDBimport (which is GATK's current recommendation in place of CombineGVCF) because of the caveat that it can run on one contig at a time and even though I have access to a grid engine, my calculation suggested it will take years to get over because of my number of contigs.

snp variant calling GATK WGS • 5.3k views
ADD COMMENT
0
Entering edit mode

Hello,

I also research the buffalo (River buffalo) whole genome sequencing data. I have 50 sample whole genome sequencing and now I'm Recalibration stage from SNP Calling. But, as you know Buffalo do not have a known sites VCF file, Right? So, i have long time remained at this stage and i have not found a satisfactory solution. But today i was happy to see your post. Please help me to solve this problem, if possible.

Best Regard

Mostafa

ADD REPLY
0
Entering edit mode

You said that, generat GVCFs from all samples for the whole genome through Haplotypecaller without any errors, Right? Ok, But did you do the Recalibration before doing Haplotypecaller?

ADD REPLY
1
Entering edit mode

You are right. Water buffalo does not have a known sites VCF file. I have avoided this step.

This document may help:https://software.broadinstitute.org/gatk/best-practices/workflow (especially What is not GATK Best Practices?)

The main point here is you may report that you have 'adapted' the GATK germline variant calling pipeline and have removed certain steps due to lack of truth set.Another way is the bootstrapping procedure if you have a lot of time in hand. It is mentioned in this link: https://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr

The main point here is:

However, all is not lost if you are willing to experiment a bit. You can bootstrap a database of known SNPs. Here's how it works: First do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

ADD REPLY
0
Entering edit mode

many thanks for your reply,

But I do not have enough time because I am the last semester student and it's a very important time for me. On the other hand, there have been people who have already gone through this path and have concluded that this method does not work (http://evodify.com/gatk-the-best-practice-for-genotype-calling-in-a-non-model-organism/). So, I think we should remove the Recalibration step.

But my request is for you to support me in the next steps, if possible. I also want to use the Haplotypecaller option like you. So, I want to know what your command line was for this stage?

ADD REPLY
0
Entering edit mode

I used the default:

java -Xmx8g -jar HaplotypeCaller -R <your_reference_genome> --native-pair-hmm-threads 4 -I <your_bam_file> -ERC GVCF -O output.g.vcf

ADD REPLY
0
Entering edit mode

What is the option --native-pair-hmm-threads 4 ??

According to this command line, you've done the HaplotypeCaller for a sample, right? And then you combined the generated GVCFs files, right?

ADD REPLY
0
Entering edit mode

Yes..I did that only..

--native-pair-hmm-threads allows you to define no. of threads that will be utilised to compute the pairHMM step of haplotypecaller...

For more information, check this: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.3.0/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php

ADD REPLY
3
Entering edit mode
6.6 years ago
vdauwera ★ 1.2k

A simpler way of using a list of contigs or intervals to restrict analysis to areas of interest is to provide the list through the -L argument at the HaplotypeCaller stage (which will also take a lot less time to run). If you already have generated GVCFs then you can instead use the list of intervals in your CombineGVCFs command.

When you combine contigs in subset files, you don't have to make exactly 23, just be within the order of magnitude.

ADD COMMENT
0
Entering edit mode

Thanks a lot Geraldine. It does make sense. The whole selectvariants (splitting gvcfs) stage can be avoided then.. I split the GVCFs because I thought smaller gvcfs will make the job faster..

ADD REPLY
0
Entering edit mode

Hi prasundutta,

As i have already said, I also study the genome of Buffalo. i have a problem with the Haplotypecaller stage. my problem is that after running the command line, Long time remains in run stage? (Please see the photo in the attachment).

I want to know that, given the fact that our work is the same, did you have a long time to run for you at this stage??

code i run:

java -jar /home/m.rafiepour222/GenomeAnalysisTK-3.8-1-0-gf15c1c3ef/GenomeAnalysisTK.jar -R /home/m.rafiepour222/GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa   -T HaplotypeCaller   -nct 30   -I /home/m.rafiepour222/BBKHU01_F/BBKHU01_F.sort.rmdup.bam   --emitRefConfidence GVCF   -o /home/m.rafiepour222/BBKHU01_F/BBKHU01_F.raw.snps.indels.g.vcf

piece of running:

enter image description here

ADD REPLY
0
Entering edit mode

But when i change the GATK version (Latest version, GATK-4.0.4.0), run is done (Please see the photo in the attachment):

enter image description here

But as seen in the photo, there seems to be a problem?? and that's about these phrases (Annotation will not be calculated, genotype is not called or alleleLikelihoodMap is null). Which is seen in a large number at runtime. As you see in the photo.

So, first, I want to know which version of GATK did you use? And the second, Did you encounter these phrases?

code i run:

./gatk HaplotypeCaller -R /home/m.rafiepour222/GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.fa --native-pair-hmm-threads 4 -I /home/m.rafiepour222/2_BBKHU02_M/2_BBKHU02_M.sort.rmdup.bam -O /home/m.rafiepour222/2_BBKHU02_M/2_BBKHU02_M_variants.g.vcf -ERC GVCF
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Excuse me, but I did not understand, and I do not know whether my command line is working now or not? Because i feel like it's being annotated, not variant calling?? Respectfully, i ask you again to send me an image of your run to create a GVCF file to compare and see if my work is correct or not, please.

ADD REPLY
0
Entering edit mode

They are just warnings..Haplotypecaller did not find any reads in that position and hence genotype likelihhod could not be calculated. vdauwera can correct me here..

ADD REPLY

Login before adding your answer.

Traffic: 2686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6