I need a way to assign a number to genes representing how broken it is depending on how many SNPs with minor alleles appear on the gene. Based on just a handful of SNPs though, not every SNP.
So lets say for example I know a few SNPs for COMT:
Rs165599 AA
Rs6269 AA
Rs737865 AA
Rs4633 TT
Rs769224 GG
Rs4680 AA
and some SNPs for STAT3:
Rs744166 GG
Rs8069645 AG
Rs9891119 CC
Rs6503691 CC
and I wanna find out which gene is more defective (as in which has the most diminshed ability to encode the corresponding protein), how can I do that? These genes obviously have way more SNPs than just those ones, and in every gene I will have many minor alleles so I'm guessing its only certain SNPs that cause problems for the gene. So could I find a number that tells me how much a bad allele of a particular SNP breaks the gene and then use that number to get an overall idea of how malfunctioning the gene is?
Thanks for the answers, I didn't know about VEP, SIFT and PolyPhen. I tend to use every one of those words you listed. I'm not so good with euphemisms but point received, have to be more considerate with my terminology :) So I see that VEP tells you what type of variant it is, so I'm guessing with nonsense variants you can assume the genes function is gonna be greatly altered. I couldn't figure out if they assign a number as an estimate of how greatly the genes function is altered by the allele. Do they actually have that?
Also is there a database which has these kinds of values determined through scientific studies (as opposed to computational methods which I'm assuming VEP uses)? And I wonder how this would work with multiple SNPs, would you add the values together? Like for example lets say in my COMT gene I have a minor allele on Rs165599 which is known to reduce the genes ability to produce COMT by 10%. Then on Rs6269 I have two minor alleles which reduce the genes transcription abilities by 20% for each allele? Would this add up to 10% + 20% + 20% = 50% meaning that the gene now only produces half as much COMT as a gene without these variants? I know things are gonna be way more complex than this, but could you make a rough estimate using this logic?
Unfortunately it doesn't work like this for two main reasons:
The first is that going from genomic code to phenotype is a huge huge problem, of which humanity is no where near solving (if it ever will). The programs mentioned, although very good at what they do all things considered, can not be used as "evidence" for anything because the scores they give are almost meaningless. Putting sequencing data through them where the causative variant is known, SIFT/PolyPhen either calls the variant right at the top (if it's a premature stop), or right at the bottom (if it's basically anything else). This is because there are lots of things that can go wrong in mapping that result in things that look like frameshifts, and there are plenty of non-synonymous variants which do nothing but look scary - and the true variant is a change in some promotor/enhancer site, not even registered by these programs as a problem (possibly because the site isn't even known). Because there is so much noise in this data, adding up scores for genes really just adds up noise - diluting out the real instances of true signal.
The second problem - which is really the first problem from a different angle - is that genetics doesn't work like this.
As far as evolution is concerned, two wrongs can make a right - for example deleting and then later adding a base could be counted as two very serious frameshift mutations; but might have absolutely no effect on the amino-acid sequence at all.
Alternatively, a premature stop mutation in one gene might be irrelevant since there are 5 other backup copies. Things like copy-number-variations may not be visible via sequencing at all, but have an obviously huge effect on the individual's phenotype. So basically, as I think I said in the other thread h.mon linked, these tools do a good job at trying to reduce some of this complexity - but these days you cannot publish a paper because SIFT gave it a really high score. You need direct evidence, because when you go fishing with high quality bait you'll always catch some kind of fish.
Well theres OMIM - Online Mendelian Inheritance in Man - but its kind of old. The guy who started it, Dr McKusick, was the PI of my old PI so im a little biased. There are probably other databases out there now that work with probability estimates and small effects being registered too, etc. I'd start by looking at all DbSNP has to offer by way of metadata.
No way - because of point 2 again, because a second change might rescue the phenotype, or might be unable to make it worse. Also, you have 2 of every gene - it would be important to know which allele the variant is on because a 20% reduction in one gene might mean a 120% boost in the other copy to compensate (often the case).
In addition to what has been said, the number of SNPs with minor alleles in a gene may also depend on how different the ethnicity of your sample is from the reference sequence.