Hi, I work in sunflower (H. annuus) transcriptomics. I'm trying to genotype a big amount of RNAseqs samples from wild accessions (coming from distant places).
I have used the reference genome for alignment and resulted a not-so-terrible alignment rate (mean=92%,min=81%,max=95%). Now, for the BQSR step I have a reference SNP database (a big .vcf file) made from commercial cultivars (like the reference genome). Considering I expect important differences between the genomes and the reference, Is it OK to use the "reference" SNPs? Should I ignore it and use the VariantCalling->BQSR->VariantCalling approach? Or maybe a mixed approach where I feed the reference .vcf AND the new .vcf to the BQSR?
What I have googled about reference bias in variant calling, is that the bias starts in the alignment step (of course) and trikcles down from there; but not mention of this.
Now, I did a little test of both alternatives in a single sample: I called variants, hard filtered and bqsr ('self_snpdb'), parallel to just bqsr with the reference snpdb.
Firstly, only 37% of variant positions in the 'self_snpdb' where present in the 'reference_snpdb'; which make sense given the population divergence.
Then, I used GATK AnalyzeCovariates to compare the bqsr_tables. Resulting in these graphs: ('before' corresponds to the reference_db bqsr table; 'after' corresponds to the self_snpdb bqsr table)
There seems to be a clear difference, but I really don't know what to interpret from here.
Thanks for your time! Andrés