Entering edit mode
10.4 years ago
dutro2
•
0
Does anyone know what the SGB field in SAMtools' mpileup output means? The header field ("Segregation based metric.") doesn't really shed much light on it either. Is there a paper or some other source that details what this value means and the theory behind it/how it is calculated?
Hello dutro2!
It appears that your post has been cross-posted to another site: samtools email. Just wait for one of the authors to reply (or look through the source code).
This is typically not recommended as it runs the risk of annoying people in both communities.
No worries, you've actually inspired me to look at the source-code for how that's generated to try and figure out what it means (I have an idea, but I'm not sure yet). For reference, that was added by pd3 in April 2013 without a meaningfully related comment.
I'd be interested to know what your idea is. After some Googling, my guess was that it has something to do with Mendelian segregation. I come from a computing background, so if I'm completely off-base on this then please let me know :).
I also found this tool: https://github.com/nc6/segregationBias, but I haven't found anything in there resembling the code in calc_SegBias (plus the Haskell code is practically unreadable IMO).
That's probably not far off. It looks a lot like a log-likelihood calculation, though I can't seem to find what distribution or equation it's derived from. In short, it looks at how the variant reads are distributed over samples and will give higher (more positive) scores in cases are all in one or a few samples. Since this is used in the context of multi-sample variant detection, it's to enable finding false-positive calls that are only present due to low-frequency sequencing errors in multiple samples. That's my guess at least, though it'd be good if Petr Danecek wrote back to your question on the samtools list since he would be the person to actually know.