Significance of motifs around mutations
2
0
Entering edit mode
5.9 years ago
Gene_MMP8 ▴ 240

I have a set of disease-causing mutations and their flanking neighborhoods (4bp on each side). I want to determine if these flanking neighborhoods are different from non-disease mutations
Eg.
ATTG M TTGA (M=Mutation, disease-causing)
TTAG M GAGG (M=Mutation, non-disease causing)
How do I estimate the background distribution of all such 4bp on each side neighborhoods for the entire genome? Can you help me formulate a statistical test to differentiate between the two?

Assembly R • 1.4k views
ADD COMMENT
1
Entering edit mode
5.8 years ago

Do you have a specific motif around disease causing mutation ? If you have one specific motif, you can check how often you would observe the same same motif around non-disease causing SNPs and perform a simple Fishers test.

If you don't have a prior motif/k-mer , You can take all your disease causing mutations and perform a k-mer enrichment type of analysis or a typical motif analysis to find enriched patterns

Do the same for non-disease causing SNPs and you may not find the similar k-mer or motifs around non-disease causing mutations.

PS: I'm not a geneticist.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. I have tried something very similar to your first suggestion. I tried finding motifs that are over-represented around disease-causing mutations as compared to non-disease. I also have a list of such motifs with their respective significance values. But the problem is I am unable to attach any biological significance to that. Say if I have AATG over-represented around a particular mutation type (C>G), how do I go back and check whether there's any biological basis for such behaviour. Any help is greatly appreciated!

ADD REPLY
0
Entering edit mode

That needs more information and biological context like if they are coding, non-coding, splice-site variants etc etc

ADD REPLY
0
Entering edit mode
5.8 years ago

If you have fasta sequence aligned format of your SNP sequences (or you can create aligned SNPs using snp-sites-linux based:https://github.com/sanger-pathogens/snp-sites), then you can import to Tassel or other online softwares to create a Hapmap file. Seperately, use your reference sequence in ENSEMBL blastn server to map the positions of your original reference sequence (not the snp file). Subsequently compare your positions your SNP positions from your SNP data and ensembl result to predict accurately and create a VCF file using Hapmap file again from Tassel software itself. Once, you have VCF file, use VEP (https://www.ensembl.org/Tools/VEP) at ENSEMBL to detect variant forms of sequences at nucleotide and protein level around the region of your interest.

ADD COMMENT
1
Entering edit mode

This is nothing to do with what OP wants.

ADD REPLY

Login before adding your answer.

Traffic: 2395 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6