I just learned about this Base Quality Score Recalibration (BQSR) step, which seems to be really important for variant calling and that seems to be highly determined by the size of the variant database used (eg. see this conference paper).
I'm wondering how could I run BQSR for my data given that I'm not using humans or any other organism with a public database for SNPs or variants. Should I just generate a cvf file with a program such as GATK HaplotypeCaller and use it as database, or are there any other "best practices" for this?
For example, if different species were sequenced with the same technology, it would be safe to construct a database using data from all of them assuming that sequencing errors in these will be purely due to technical errors?
Thank you!
By best knowledge, there is yet a paper to demonstrate that this step is truely necessary and/or beneficial, beyond of what the Broad Institute recommends. Also as you correctly note, you need a well-curated reference set of SNPs to properly run it. If you don't have that you might simply not do it at all. See e.g. here https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1279-z#Abs1. I suggest you browse the literature on benchmarkings towards this method and then decide if it is worth the additional effort.