What Are The Advantages Of Calling Variants On Individuals, On Trios, And On Cohorts?
1
7
Entering edit mode
12.3 years ago

What are the advantages/disadvantages of calling variants simultaneously on individuals, on trios (mom, dad, child), and on cohorts of individuals?

vcf variant-calling • 6.8k views
ADD COMMENT
1
Entering edit mode

I used to answer the advantage of multi-sample calling, but cannot find it now.

ADD REPLY
0
Entering edit mode

I would love to hear your answer.

ADD REPLY
1
Entering edit mode

I'd love to know more about this as well.

Here is a link to the discussion on GATK's forum about this topic that I was following a while back.

https://getsatisfaction.com/gsa/topics/multi_sample_vs_single_sample_snp_calling_using_unifiedgenotyper

ADD REPLY
0
Entering edit mode

i heard in cases of low coverage, the remaining individuals will somehow inform or impute the calls done by UnifiedGenotyper (or other callers). However, I don't understand how this can be done accurately unless the caller understands the pedigree. Why would I want my alternate allele informing my wife's genotype?

ADD REPLY
0
Entering edit mode

Imagine the genealogical tree at SNP position A, relating a lot of people in your population, including you and your wife. This is some tree that gradually comes together to a common ancestor some time in the past. If this SNP is polymorphic in this group, the mutation must have happened since this ancestor. Imagine the mutation happens at some point in the past, maybe three "levels" up from you, such that you inherit this mutation. You and everyone else who is a "child of that mutation", will share that SNP allele, and noone else will. Now imagine what happens as we move along the chromosome. If we were asexual, that tree would be the same. But recombination comes in and starts shuffling, leading to tree changes. Of course, other mutations have also happened, at other points in the tree in this time, so when a mutation happens it is associated with a set of SNP alleles (basically the alleles insize the person who has the mutation), and then over the years, recombination splits them up. But recombination rates are not enormous, so it doesn't immediately randomise everything.

So what does this lead to - alleles occur in blocks, consisting of chumks of sequence within which recombination has not happened since those mutations appeared. So there is a correlation between people's genotypes, and although you don't really know how the tree varies, you can approximate/estimate.

OK, that's my explanation/understanding, sorry it wasn't very concise.

ADD REPLY
0
Entering edit mode

Put imputation aside. That is a very different thing. For low-coverage SNP calling, you cannot call a SNP if you see two supporting reads, but with many samples, it is possible that the same SNP occurs in two samples. You get four allelic reads and are more likely to call it.

ADD REPLY
0
Entering edit mode

You are right. But as number of samples N goes up, number of sites scales like log(N), but number of errors scales like N. That's why I thought the ability to use imputation was an advantage of calling on many samples, to reduce the FDR. I guess you are saying the population allows you to sensitively call most SNPs, because most are shared between people, and you think that's the main advantage. (pause) yes - you're right!

ADD REPLY
1
Entering edit mode

For SNP calling, if you have 10,000 samples, we may not be able to call SNPs with allele count 2, but overall we still have much higher power to access low-frequency SNPs. The power is increased quicker than the accumulation of sequencing errors.

Most current SNP callers do not use LD information at all and I actually do not think they will benefit from LD. Most rare SNPs, except MNPs, are not in LD with nearby SNPs. LD contributes little in this case.

LD is immensely useful for genotyping for sure, but I guess we are talking about SNP calling/site discovery right now.

ADD REPLY
0
Entering edit mode

This has been instructive! Thanks Heng.

ADD REPLY
0
Entering edit mode

Clarification on power vs. errors. Errors uniformly occur to every sample. It is very unlikely to see two errors arising from one sample. On the other hand, once you have one additional sample possessing the SNP allele, you will see 2 supporting reads in average in that sample, very different from the behavior of errors. This is one of the key signals a multi-sample caller uses to distinguish errors and true SNPs.

This is also why multi-sample calling has much higher power than pooled calling where you do not know which sample each read comes from. For pooled analysis, errors go up faster than power at very large sample size and ultimately overwhelms the power.

ADD REPLY
0
Entering edit mode

OK your clarification has somewhat confused me. The stuff about pooled versus non-pooled is fine. And I quite agree you can exploit the different stats of errors/polymorphisms. However this sentence "once you have one additional sample possessing the SNP allele, you will see 2 supporting reads in average, very different from the behavior of errors" doesn't make much sense to me. Why is that very different to the behaviour of errors? The expected number of errors (irrespective of whether the error is identifiable), might be below 2, =2, or above 2, depending on the total depth of coverage (across samples) at that site. I tend to think of this in terms of allele-balance; i completely agree that SNPs and errors are distinguishable above some minimum allele frequency, but below that, I don't see that the statistics of SNPs and errors are distinguishable? What am I missing/misinterpreting?

ADD REPLY
0
Entering edit mode

I have modified the original reply a little bit. Given 4-fold coverage, an error rarely occurs to the same sample twice, but a true SNP often does. There is indeed a frequency threshold below which errors and SNPs are not distinguishable, but the threshold is lowered given an increased sample size.

ADD REPLY
0
Entering edit mode

OK! I get your point now, thanks again :-)

ADD REPLY
1
Entering edit mode
12.3 years ago
JC 13k

The simple answer: using pedigree information can help to reduce the errors produced by the sequencing.

With a trio, you can easily identify mendelian inherence errors (MIEs), with quartets (both parents and 2 offsprings) you can phase the genome. In terms of "rare" human diseases, the pedigree analysis increase the power of the variants detected.

More information: http://www.ncbi.nlm.nih.gov/pubmed/20220176 http://www.ncbi.nlm.nih.gov/pubmed/21855840

*Edit: after lh3 comments, those references apply pedigree information after variant calling.

ADD COMMENT
0
Entering edit mode

The GATK unified genotyper doesn't accept a pedigree file so I don't think it can infer the relationship between the samples while doing the variant calling.

ADD REPLY
0
Entering edit mode

I'm not talking about GATK, and FYK, you can integrate trio information to it: http://www.broadinstitute.org/gsa/wiki/index.php/Pedigree_Analysis_Using_the_GATK

ADD REPLY
0
Entering edit mode

This is interesting, I did not know you can supply the pedigree file to the genotyper itself other than the phasing walkers.

ADD REPLY
0
Entering edit mode

From that page, GATK does not use the pedigree information. It just throws away offsprings. That is very different from joint analysis. Also Roach et al. is really the opposite example: complete genomics does not use pedigree information and therefore they are always severely underperformed for pedigree analysis. The 1000g analysis is on the right track.

ADD REPLY
0
Entering edit mode

GATK throws away offsprings using the information of the pedigree, I'm not saying it will compute a joint analysis. In Roach et al. paper, the pedigree information is incorporated after the CG varian calling to remove errors. In a strict sense you are right, none of the methods use the pedigree information in the variant calling phase, thanks for the clarification.

ADD REPLY
1
Entering edit mode

For multi-sample analysis, it is essential to jointly analyze the data right from the beginning; otherwise you will lose power or get spurious results. This is the weakest point of CG, at the moment. I have heard several groups having the exact problem with CG data.

ADD REPLY

Login before adding your answer.

Traffic: 1830 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6