Question

Significance of motifs around mutations

0

Entering edit mode

5.9 years ago

Gene_MMP8 ▴ 240

I have a set of disease-causing mutations and their flanking neighborhoods (4bp on each side). I want to determine if these flanking neighborhoods are different from non-disease mutations
Eg.
ATTG M TTGA (M=Mutation, disease-causing)
TTAG M GAGG (M=Mutation, non-disease causing)
How do I estimate the background distribution of all such 4bp on each side neighborhoods for the entire genome? Can you help me formulate a statistical test to differentiate between the two?

Assembly R • 1.4k views

ADD COMMENT • link updated 5.8 years ago by GouthamAtla 12k • written 5.9 years ago by Gene_MMP8 ▴ 240

score 1 · Answer 1 · 2019-01-31

1

Entering edit mode

5.8 years ago

GouthamAtla 12k

Do you have a specific motif around disease causing mutation ? If you have one specific motif, you can check how often you would observe the same same motif around non-disease causing SNPs and perform a simple Fishers test.

If you don't have a prior motif/k-mer , You can take all your disease causing mutations and perform a k-mer enrichment type of analysis or a typical motif analysis to find enriched patterns

Do the same for non-disease causing SNPs and you may not find the similar k-mer or motifs around non-disease causing mutations.

PS: I'm not a geneticist.

ADD COMMENT • link 5.8 years ago by GouthamAtla 12k

0

Entering edit mode

Thanks for your reply. I have tried something very similar to your first suggestion. I tried finding motifs that are over-represented around disease-causing mutations as compared to non-disease. I also have a list of such motifs with their respective significance values. But the problem is I am unable to attach any biological significance to that. Say if I have AATG over-represented around a particular mutation type (C>G), how do I go back and check whether there's any biological basis for such behaviour. Any help is greatly appreciated!

ADD REPLY • link 5.8 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

That needs more information and biological context like if they are coding, non-coding, splice-site variants etc etc

ADD REPLY • link 5.8 years ago by GouthamAtla 12k

score 0 · Answer 2 · 2019-01-30

If you have fasta sequence aligned format of your SNP sequences (or you can create aligned SNPs using snp-sites-linux based:https://github.com/sanger-pathogens/snp-sites), then you can import to Tassel or other online softwares to create a Hapmap file. Seperately, use your reference sequence in ENSEMBL blastn server to map the positions of your original reference sequence (not the snp file). Subsequently compare your positions your SNP positions from your SNP data and ensembl result to predict accurately and create a VCF file using Hapmap file again from Tassel software itself. Once, you have VCF file, use VEP (https://www.ensembl.org/Tools/VEP) at ENSEMBL to detect variant forms of sequences at nucleotide and protein level around the region of your interest.