Population / Reference bias in GATK BQSR BaseRecalibrator?
0
0
Entering edit mode
5.0 years ago

Hi, I work in sunflower (H. annuus) transcriptomics. I'm trying to genotype a big amount of RNAseqs samples from wild accessions (coming from distant places).

I have used the reference genome for alignment and resulted a not-so-terrible alignment rate (mean=92%,min=81%,max=95%). Now, for the BQSR step I have a reference SNP database (a big .vcf file) made from commercial cultivars (like the reference genome). Considering I expect important differences between the genomes and the reference, Is it OK to use the "reference" SNPs? Should I ignore it and use the VariantCalling->BQSR->VariantCalling approach? Or maybe a mixed approach where I feed the reference .vcf AND the new .vcf to the BQSR?

What I have googled about reference bias in variant calling, is that the bias starts in the alignment step (of course) and trikcles down from there; but not mention of this.

Now, I did a little test of both alternatives in a single sample: I called variants, hard filtered and bqsr ('self_snpdb'), parallel to just bqsr with the reference snpdb.

Firstly, only 37% of variant positions in the 'self_snpdb' where present in the 'reference_snpdb'; which make sense given the population divergence.

Then, I used GATK AnalyzeCovariates to compare the bqsr_tables. Resulting in these graphs: ('before' corresponds to the reference_db bqsr table; 'after' corresponds to the self_snpdb bqsr table)

There seems to be a clear difference, but I really don't know what to interpret from here.

1° set of graphs

2° set of graphs

Thanks for your time! Andrés

GATK BQSR population bias non model organism • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 1872 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6