BSQR in GATK without known variants
1
0
Entering edit mode
2 days ago
fuwamozu • 0

Hi, to preface:

  • I am new to bioinformatics (not my field, just using bioinformatic tools to process some data)
  • I found some similar topics, but it did not provide enough depth of coverage to answer my question, which is why I am posting about this
  • This is my first time posting on this forum, so I am not very sure how things work: let me know if I have done something wrong.

I am trying to follow GATK's pipeline to process FASTQs into VCFs. I have run into a problem in the "Recalibrate Base Quality Scores" step: namely, that I do not have a VCF file of known variants.

I am using corvid data, the reference genome of which can be found here. I do not think there exists the file I need, but the study the data I am using is from here.

GATK documentation found here states that:

For non-human data it can be a little bit more work since you may need to bootstrap your own set of variants if there are no such resources already available for you organism, but it's worth it.

Here's how you would bootstrap a set of known variants:

First do an initial round of variant calling on your original, unrecalibrated data.

Then take the variants that you have the highest confidence in and use that set as the database of known variants by feeding it as a VCF file to the BaseRecalibrator.

Finally, do a real round of variant calling with the recalibrated data. These steps could be repeated several times until convergence.

Would the correct way to go about this be to

  • Run HaplotypeCaller
  • Run CombineGVCFs (because I have multiple samples)
  • Run GenotypeGVCFs
  • Filter variants using VQSR (or should I use SelectVariants + VariantFiltration ??)
  • Run BaseRecalibrator
  • Apply BQSR

After this, would I just repeat from variant calling iteratively until convergence? I assume after this first time I do not need to combine any GVCFs (or maybe I am wrong).

Sorry that I have a lot of questions, I just feel very lost. Thanks in advance.

GATK • 151 views
ADD COMMENT
0
Entering edit mode
14 hours ago

Don't bother with BSQR, I found that those steps don't really make a difference - in general don't overdo your analysis and don't get bogged down on these early steps.

Get through all steps quickly then revalulate and rerun later on subsets and see if anything changes.

I would use bcftools, in my assesment it was faster, simpler and better than GATK.

ADD COMMENT

Login before adding your answer.

Traffic: 1596 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6