Calculating Allele Frequencies From Genotype Data
5
8
Entering edit mode
14.0 years ago
Andrea_Bio ★ 2.9k

Hello

I am looking at some genotype data for some pooled DNA with a sample size of 10. The data is in dibayes format and for each snp tells me the coverage of that snp, the number of times the reference allele was counted and the number of times any other alleles were counted.

As an example one SNP had 7x coverage and was the minor allele was seen 1 out of 7 times and the major allele was seen the other 6.

How do you work out the minor allele frequency for this population. There are 20 chromosomes present but we've only seen 35% of them (7 out of 20) so we can't simply say the MAF is 14%

Also other SNPs may have higher or greater coverage and I presume you need to account for that somehow too.

I would ultimately like to create a simple allele frequency spectrum. I've had a look for some information on this but all the papers i have seen are way too complicated for what i need. Can anyone recommend a basic introduction to this analysis?

thanks a lot

allele frequency • 25k views
ADD COMMENT
0
Entering edit mode

Have you solved this problem properly? I encountered this question recently. That is how to transfer the snp frequencies in plooled data to allele frequencies. Thank you!

ADD REPLY
13
Entering edit mode
14.0 years ago

if your intention is to do population statistics, you will have to work not at read level (coverage) but at sample level. the MAF value would be the number of times an allele appears in less samples than the other allele, and that doesn't have to do with the coverage. in fact the coverage would only help you with the SNP calling, but once the SNPs are called that's all.

there aren't many meaningful statistics you can do having only 10 samples, but you can try the following measurements: allele frequency (this is self-explanatory), heterozygosity (each snp's ratio of heteros/heteros+homos), or even local inbreeding (Fs). you won't be able to calculate other population statistics indices such as Fst or In because these measure distances inter-population, and not intra-populations.

I cannot think about any other best readings than basic population genetics text books (such as "Principles of Population Genetics" Hartl 1997, Sinauer Associates or "Population Genetics, a concise guide" Gillespie 1998, Johns Hopkins University Press), but for understanding F-statistics I've always recommended following this worked example by Dr. David McDonald's.

ADD COMMENT
0
Entering edit mode

Please note i only have one sample. The sample was pooled DNA of 10 individuals and should contain 20 chromosomes but there is only one set of sequencing data. I was going to do some exploratory statistics to learn about the field and my data at the same time. I thought i would investigate variance of the allele frequencies along the chromosomes to look for areas that deviate signficantly from the norm but i didn't know the correct way to calculate the allele frequency.

ADD REPLY
0
Entering edit mode

Remembering that it is pooled DNA and I only have one sample, can you still calculate the frequencies at the coverage level (e.g 14% and 25%)?

ADD REPLY
0
Entering edit mode

Remembering that it is pooled DNA and I only have one sample, can you calculate the frequencies at the coverage level?

ADD REPLY
0
Entering edit mode

sorry I didn't realize you were talking about pooled DNA. I have opened a new answer to share both my lack of knowledge on dealing with NGS pooled DNA and my scepticism on the viability of using NGS reads counting for such a task, really hoping that any other reader may bring some more light on this issue.

ADD REPLY
0
Entering edit mode

assuming this isn't pooled DNA i'm stil not sure i follow. If you had a minor allele that didn't appear very often and you had 10 different samples, then presumably the minor allele would always appear less than the major allele and you would get a MAF of 0% you would have to have lots of samples to find a population where the minor allele appeared more than the major allele?

ADD REPLY
0
Entering edit mode

first you have to think that if you work with SNPs you are expecting MAF values > 1%, so to appreciate such alleles you should ideally genotype hundreds of samples. and by definition, MAF is the frequency of the minor allele in a population, so if you work with a population where what in others is the minor allele now it turns out to be your major one, then the minor allele is the other one. minor alleles are estimates that depend on the population, so unless you know your population very well you can't really tell which one is the minor allele before genotyping.

ADD REPLY
0
Entering edit mode

one of the main problems with population genetics and large databases is that they tend to summarize a lot of data, so when they report a MAF value people tend to think that the corresponding allele will be their minor in their population, but that shouldn't be the case always: an AC SNP may have a C allele in 30% europeans being C the minor one, but maybe in 70% of africans making the A minor now. so even if the MAF value of 0.3 would be the same for both of them, it's referring to different alleles. in population genetics statistics you ALWAYS have to double check what pop is being described

ADD REPLY
0
Entering edit mode

thanks for your comments. so in summary i was right to say that you have to sample hundreds/thousands of samples to get the MAF at the sample level? Do you happen to understand how Larry Parnell got the figures of 25% and 14% in the example above as i can't see how and its sort of bugging me a lot now.

ADD REPLY
0
Entering edit mode

also, is there anything useful you can do with figures at the read level?

ADD REPLY
0
Entering edit mode

if you have an AC SNP at 70%/30% in your population, you may have the chance (by statistical probability) to see the C allele in 3 of your 10 samples. as MAF is a population genetics index, a few hundreds of samples should be used in order to make certain assumptions. regarding the point Larry wanted to make, he tried to illustrate how 6 counts of the major allele versus 1 count of the minor may represent MAF values of 1/4 (if the 6 counts represent only 3 chromosomes) or 1/7 (if the 6 counts come from different chromosomes). my point is that it's really hard to tell by counting small numbers.

ADD REPLY
5
Entering edit mode
13.9 years ago

I like Jorge's answer here (and his others on SNPs) very much. Think of this way. If the major allele is found 6 times but from 3 chromosomes (some chromosomes were read by the sequencing machine more than once) and the minor allele is found once, then the MAF is ~25% (1/4). And if the major allele is found 6 times from 6 different chromosomes and the minor allele is again found once, then the MAF is ~14% (1/7). This illustrates what Jorge wrote about the difference between sampling read data and sampling individual chromosomes/individuals.

ADD COMMENT
0
Entering edit mode

I like the answers of both you and Jorge very much when it comes to SNPs :)

ADD REPLY
0
Entering edit mode

thanks you 2 for the credit. we all are humbly contributing as best as we can to this forum, and I really think that it does help people out there.

ADD REPLY
0
Entering edit mode

The examples above happen all the time, but the positive and negative effects will be canceled given a large data set. Computing base frequency is valid when there is no sequencing errors; just the variance is larger.

ADD REPLY
0
Entering edit mode

At one site, base counting may give an overestimated or underestimated frequency, but that does not matter. What matters is the average over many sites. If you have many sites at a true frequency f, you can get back f with base counting if there were no sequencing errors. Without sequencing errors, base counting is a valid unbiased estimator. Nothing is wrong with that except.

ADD REPLY
0
Entering edit mode

Larry, are you saying that, for the first example, you have performed one round of sequencing of 3 versions of the same chromosome with a coverage of 7 and found the major allele 6 times and the minor allele once as I'm not quite sure where you get the numbers (25%) so i thought i'd better check i've understood your scenario.

ADD REPLY
0
Entering edit mode

My point is, as said by Jorge in different words, you may have a certain number of reads across a SNP, but you don't know actually how many chromosomes that represents. Thus, the percentage of reads with allele 1 and the percentage with allele 2 can only give an estimate of MAF. Put another way, if you have a pool of 10 individuals (=20 chromosomes), possible MAFs are 0%, 5%, 10%, 15% etc., but if you have 23 reads over that SNP, you know you have sequenced one or more chromosomes more than once.

ADD REPLY
0
Entering edit mode

hi, i think i understand the difference between read coverage and individual coverage but i don't know how you get the figures for the example you gave of 25%. I asked some maths people and they came up with a different number that's why i wanted to try and make sure I had understood how the data in your example was obtained

ADD REPLY
0
Entering edit mode

I appreciate that the number of reads can only give an approximation of MAF but i still don't know how you can get an approximate MAF of 25% from the data if you have performed one round of sequencing of 3 versions of the same chromosome with a coverage of 7 and found the major allele 6 times and the minor allele. Is this figure of 25% supposed to be at the read level or population level. I presume it can only be at the coverage level as we only have one sample. I know it might have only been an example but its bugging me now (and the other person i asked) :-D

ADD REPLY
0
Entering edit mode

Try this: Subject 1 is homozygous for G and gives 3 reads across the SNP. Subject 2 is heterozygous with 3 reads across the polymorphism - 3 for G and 1 for A. (Subjects 3 through 10 in our pool were not even sampled likely because the read depth is too low) Two chromosomes from Subj 1 and one from Subj 2 gives us the major G allele. One of the Subj 2 chromosomes gives us the A. Thus, 1/4th of the chromosomes actually sampled or sequenced show A and the MAF is 25%. If I sequence one chromosome 1000 times and find C at a SNP and another chromosome just 3 times and see T, MAF = 50%.

ADD REPLY
5
Entering edit mode
13.9 years ago

I was editing my previous answer, and then I realized how long it became, so I decided to open a new answer since now it covers pooled DNA appropriately (although unfortunately doesn't completely solve your problem). I should have done it before, but I guess I didn't get the right point at first. here we go then...

MAF, in essence, measures how probable it would be to find a certain allele in a population. calculating it directly sampling individuals is straight-forward, but I guess that using pooled DNA some further statistics are to be followed. unfortunately I haven't done any work on that, and maybe Larry's idea is enough, but I guess some further reading may be appropriate. I just followed PMID:16643673 and discovered 2 papers (PMID:15677751 and PMID: 11140947) that describe methods for calculating allele frequencies on pooled DNA. also, the material and methods section of this paper seem to point out the appropriate statistics to use in order to obtain allele frequencies from pooled DNA.

having said all this, I must say that pooled DNA techniques have been studied for years in Sanger sequencing and genotyping, but not that much with NGS as far as I know. one major problem you may find if you rely on NGS reads counting is that you will have to consider heterozigosity, and you will have to know that an allele in heterozygosis will always be below-represented with NGS techniques. I really don't know if you can trust NGS data only to calculate allele frequencies, at least through such a straight method such as counting read differences, since there are several steps in the mapping and snp calling process that may introduce certain biases. I would encourage other BioStar readers to share any publication that may have covered this issue in particular: dealing with allele frequencies pooling DNA on NGS.

ADD COMMENT
1
Entering edit mode

perhaps you don't know much about pooled dna but all of my data won't be pooled and you are very helpful and very good at explaining things so i'd like to revisit this question when i know a bit more. I've got some pointers to set me on the right track now thanks to this post

ADD REPLY
0
Entering edit mode

I would like to add Nils Homer's paper that prompted closure of dbGAP public access to GWAS data (PMID: 18769715 and PMID: 18617537)commented here by Genomic Law report http://bit.ly/97SS4J

ADD REPLY
0
Entering edit mode

Jorge, thanks for the links. i will read these papers and revisit this question when i'm more informed. would you mind if i emailed you a few weeks to ask you to look at this question again? I can see you have a website link on your profile

ADD REPLY
0
Entering edit mode

again, you're welcome. the link on my profile will lead you to my personal homepage, where you may find several ways to contact me. considering that my experience with pooled DNA is null, feel free to contact me to discuss anything else related to this matter if you still think I'd be of any help.

ADD REPLY
4
Entering edit mode
13.9 years ago
jvijai ★ 1.2k

Here is a statistical analysis paper and a software for this with a funny name: PoPoolation. http://www.ncbi.nlm.nih.gov/pubmed/21253599

And here is another paper that discusses optimal pooling strategies for NGS.
http://www.ncbi.nlm.nih.gov/pubmed/21254222?dopt=Abstract

N.B: I have not read either at this point but have an interest in the methodology and application. Hope this helps.

ADD COMMENT
1
Entering edit mode
13.9 years ago
lh3 33k

If there were no sequencing errors, base counting would be an unbiased estimator of site allele frequency. When there are sequencing errors, I am not aware of any simple estimators that are good enough. The two papers pointed by jvijai are good in theory, but I doubt their usefulness in practice. The first paper aims at variant discovery but not a good estimator of frequency. The second paper seems to assume accurate base quality, which is rarely the case.

As Jorge has pointed out, for 10 samples, the best way is to barcode them. In my opinion, the additional cost at barcoding is minor in comparison to what you gain. With barcoding, estimate can be much better.

If you are aiming at something simple with your current data, probably I would discard bases with low base or mapping quality and do base counts. The spectrum at f=0 is rubbish, but the density conditioned on f>0 should be about right.

ADD COMMENT
0
Entering edit mode

Hi - i only have one sample. Its pooled DNA from 10 individuals. In the not too distant future i will have multiple samples but at present its a pooled sample.

ADD REPLY
0
Entering edit mode

amazing how much I learn from this site :)

ADD REPLY

Login before adding your answer.

Traffic: 1819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6