Entering edit mode
7.5 years ago
CY
▴
750
Some tools (FreeBayes, LoFreq, VarScan...) claim that they are good at detecting low frequency variant. Can anyone share some insight on what makes them good at detecting low frequency variant? Really appreciate!
Part of it is how well they model and compensate for platform-specific bias and problems, which cause artifacts that look like low-frequency variants. Another part is simply not hard-coding requirements and assumptions about ploidy and variant frequency distribution. I've been told that BBMap's callvariants.sh does a good job with low-frequency variants, though I didn't really design it for that purpose; I just designed it to be flexible with regards to allele frequency.
Thank you for the answer. I do variant calling a lot and often struggle with the choice of caller. Do you mind share some more details?
What do you mean by "Another part is simply not hard-coding requirements and assumptions about ploidy and variant frequency distribution"?
Also, I found most callers are just make a distribution assumption of sequencing errors and make judgement based on the number of supporting reads and the predefined error rate. There must be some rooms for improvement from other aspect, right?
I have not delved into the code of any variant-caller other than my own (this is the third one I've written). But the first one I wrote did have hard-coded assumptions that the goal was variant-calling of germline variants on the (diploid) human reference, except for chrM, and X/Y in males, which were haploid. The reason is simply because that's the kind of data I was working with at the time. That turned out to be not very flexible for use in low-frequency variant analysis, or variant-calling in other contexts. It was fundamentally incapable of handling a tetraploid, for example, since it was written to calculate the probabilities assuming a distribution of events from exactly 2 alleles (on autosomes).
The current variant caller allows you to set whatever ploidy you want, and adjust the probability by setting a baseline rarity of allele frequency of interest. Thus if you tell it you're looking for 5% alleles, it does not penalize the quality score of variants simply for appearing at a 5% rate (but it does if they are below that). Instead they are only penalized for other factors, like mapq, base quality, distance from read tips, strand bias, etc.