Question

Meaning of SGB field in samtools mpileup output

0

Entering edit mode

10.4 years ago

dutro2 • 0

Does anyone know what the SGB field in SAMtools' mpileup output means? The header field ("Segregation based metric.") doesn't really shed much light on it either. Is there a paper or some other source that details what this value means and the theory behind it/how it is calculated?

SNP • 4.8k views

ADD COMMENT • link updated 10.4 years ago by pd3 ▴ 350 • written 10.4 years ago by dutro2 • 0

0

Entering edit mode

Hello dutro2!

It appears that your post has been cross-posted to another site: samtools email. Just wait for one of the authors to reply (or look through the source code).

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

Sorry about that. I wasn't sure how much overlap there was between the communities.

ADD REPLY • link 10.4 years ago by dutro2 • 0

0

Entering edit mode

No worries, you've actually inspired me to look at the source-code for how that's generated to try and figure out what it means (I have an idea, but I'm not sure yet). For reference, that was added by pd3 in April 2013 without a meaningfully related comment.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

0

Entering edit mode

I'd be interested to know what your idea is. After some Googling, my guess was that it has something to do with Mendelian segregation. I come from a computing background, so if I'm completely off-base on this then please let me know :).

I also found this tool: https://github.com/nc6/segregationBias, but I haven't found anything in there resembling the code in calc_SegBias (plus the Haskell code is practically unreadable IMO).

ADD REPLY • link 10.4 years ago by dutro2 • 0

0

Entering edit mode

That's probably not far off. It looks a lot like a log-likelihood calculation, though I can't seem to find what distribution or equation it's derived from. In short, it looks at how the variant reads are distributed over samples and will give higher (more positive) scores in cases are all in one or a few samples. Since this is used in the context of multi-sample variant detection, it's to enable finding false-positive calls that are only present due to low-frequency sequencing errors in multiple samples. That's my guess at least, though it'd be good if Petr Danecek wrote back to your question on the samtools list since he would be the person to actually know.

ADD REPLY • link 10.4 years ago by Devon Ryan 104k

score 2 · Answer 1 · 2014-06-30

2

Entering edit mode

10.4 years ago

pd3 ▴ 350

Here are math notes for SGB calculation:

http://samtools.github.io/bcftools/
http://samtools.github.io/bcftools/rd-SegBias.pdf