SNP distribution/genome locations
1
2
Entering edit mode
9.5 years ago
H. ▴ 20

Hi everyone,

I have a few questions about the distribution of disease-SNPs on the genome. I apology in advance if some of them are very banal to you. I'm not a bioinformatician but a statistician. I've tried to obtain answers on my own, by searching on the web and on this forum but I'm still unsure if I got them right and I didn't find all I was looking for.

Therefore your help will be very much appreciated...!

So, if my sources are correct, each genome has approx. 3.3mio SNPs among which about 1.3mio are in intragenic regions. Among the latter, 25'000 to 40'000 SNPs are in protein-coding regions.

Q1: if I understood correctly, a gene is not wholly considered as a protein-coding region. What is exactly the difference between a gene and its protein-coding part? And what about exons? Are they a synonym for protein-coding regions or for genes or...?

Q2: Are the boundaries of the protein-coding region considered as a known information or have they to be inferred (how, in that case?)?

Q3: Is it ok to believe that the proportion of SNPs associated with a given phenotype will be much higher in the protein-coding region than in the rest of the genome? Can we say the same for SNPs in intragenic regions when compared with those in intergenic regions?

Q4: Given that the number of SNPs in non-coding regions is much higher than in coding-regions, I guess that most GWAS will mark much more SNPs as "significantly associated with the phenotype" in the non-coding regions. Is this guess correct? But still, in terms of proportions with respect to these to regions, a SNP in coding-regions is more likely to be associated with the phenotype, is that right?

Q5: In the common Illumina chips (for instance with 500k SNPs) is the proportion of SNPs found in protein coding region favoured with respect to SNPs in intergenic regions? If yes, what is the typical proportion of such SNPs on these chips? The proportion of SNPs found significant in GWAS would also depend on these choices made by the chip constructed then.

I thank you very much in advance....

coding-regions SNP phenotype • 3.4k views
ADD COMMENT
1
Entering edit mode
9.5 years ago
Floris Brenk ★ 1.0k

Q1: Genes have exons and introns... Exons are usually coding for proteins, introns not... However in some transcripts (and later proteins) exon can be skipped.

Q2: Boundaries of exons are known. Typically by software or by RNA sequencing. Also open reading frames and splice site sequences are known so from this proteins can be predicted.

Q3: I don't completely understand the question. Variants are present in coding and non coding part of the genome. However more variation is allowed in non coding parts of the genome since in general they are less crucial. Keep in mind that there are several types of coding variants: Loss of function variants (splice sites, frameshifts), missense and synonymous mutations. Where loss of function variants have likely a damaging effect to the protein, but not necessarily are disease causing. Missense mutations can be disease causing but only in vital parts of proteins in vital proteins. For example in APP or PSEN1 in Alzheimers. Even some synonymous mutations can be disease causing...

Q4: Yes, that is correct, but not necessarily following your hypothesis. GWAS are usually risk factors and not disease causing mutations. So GWAS non coding variants are for example often eQTLs or in the neighborhood of a regulatory element. So in general: no SNPs in coding regions are not more likely to be associated with disease, GWAS are usually risk factors and more regulatory variants in my view.

Q5: Usually Illumina chips are designed based on tagging SNPs and followed by an imputation step to end up with 5 million variants. Because in general "common" coding variants don't cause diseases, it are the rare coding variants that are more prone to be disease causing and especially in the Mendelian diseases..

ADD COMMENT
0
Entering edit mode

Thank you very much for this long reply, it helps a lot!

ADD REPLY
0
Entering edit mode

Just one more question, to be totally clear: can I reformulate your sentence "However more variation is allowed in non coding parts of the genome since in general they are less crucial. " as: in general on the genome, the probability that a SNP is associated with a trait is bigger if it is inside an exon than if it is outside exons (or protein coding parts)? Or do I have to distinguish between "causal" and "associated"?: is it often the case (how often?) that a SNP not regulating the expression of any protein is nevertheless causal for a trait? Or can it only be "associated with the disease" in the sense that it is tagging a causal SNPs that is regulating protein expression? In general, does a SNP have to impact a protein (expression, structure, or...) to be causal for a trait?

Thanks a lot again....

ADD REPLY
0
Entering edit mode

It is crucial indeed to distinguish between "causal" and "associated". GWAS variants can be causal but more likely the majority is not. There can be many many effects of a GWAS variants: eQTL, methylation effect, altered RNA degradation effect, splicing effect etc etc. regulating expression (eQTL) is just one easy explanation of a GWAS risk variant, but this doesn't explains them all. And typically GWAS variants are in high LD with other variants...

Your last question is complicated to answer because if a SNP doesn't impact protein (expression, structure, or..) why would it be causal then? There has to be an effect to be causal...

Side note: Keep an open mind for non-coding genes... Protein coding genes are important, but non coding genes are also crucial for the biology.

ADD REPLY

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6