Hi all! I am following GATK best practices tutorial to perform the clean up of a DNAseq dataset of a non model organism (whole genome of a single individual). Everything was going ok until I arrived to the Base recalibration step (BSQR).
If there isn't a trustworthy SNPs databse available yet (which is my case), this is what GATK recommends: You can bootstrap a database of known SNPs. Here's how it works: 1-First do an initial round of SNP calling on your original, unrecalibrated data. 2-Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. 3-Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.
Can anyone provide further details on these steps? (the second step in particular).
Kind regards. Luciano
On what platform did the sequencing take place? There were some publications in the last years (and comments here at biostars) that state that BSQR has negligible or no beneficial effects at all, as today's sequencing platforms create very trustworthy base quality scores.
That does not surprise me; I have also found recalibration has little effect on variant calling in most cases.
I'd disagree with this, though. Illumina, in particular, tends to have extremely inaccurate quality scores on some platforms.
Ok, that was new to me. Are there any evaluations of base qualities available for the different Illuminas?
Here's an example for an early run on our NextSeq, compared to data from one of our HiSeq 2500's:
http://seqanswers.com/forums/showpost.php?p=156399&postcount=18
I've generated similar data for MiSeq, newer NextSeq runs (which are better than older ones), and NovaSeq, but they are kind of scattered around and I don't remember where they all are.
This is a link to our first NovaSeq run results, which has absurdly bad quality accuracy. That run also had an illumination failure, which excuses its low quality, but NOT the quality accuracy. However, a subsequent run did not have an illumination failure and the quality accuracy was extremely good (aside from the fact that it still only has 3 quality scores).
Thanks for your reply ATPoint, Sequencing was performed on a HiSeq 2500 with Sequencing v4 Chemistry. I was aware about the discussion on the real improvement of the dataset that Base Recalibration (BSQR) step provides. However still not very sure on how to generate a trustworthy vcf file to help me distinguish between real SNPs from sequencing errors.. I would appreciate if you have the cite for any paper discussing this issue. Thanks again!