Question

How to find protein motif in DNA sequence

3

Entering edit mode

6.9 years ago

Benn 8.4k

I have a protein motif or site, which I like to identify in an DNA sequence (multiple fasta file). The motif is N-X-S/T (X!=P), which means Asn, followed by any amino acid but not Pro, followed by Ser or Thr. Also X should not be STOP. So I would like to find all the 3 codon combinations for this site in DNA (9 nucleotides).

I was first thinking of getting the motif written in DNA using IUPAC coding, but that seemed not possible. Writing out all possibilities seems like a too hard task, so I thought there might be a tool which can do this? Any suggestions?

sequence motif • 2.9k views

ADD COMMENT • link 6.9 years ago by Benn 8.4k

0

Entering edit mode

Doesn't BLAST(P) already support certain redundant characters?

I'm not sure you'll be able to define all of those exactly, since typically X means any amino acid (I think), without any restriction. You may not be able to find an alphabet that supports all of what you need.

You could maybe blast: NXS and NXT, and then filter the results with a regex to make sure that the next codon is != *

ADD REPLY • link 6.9 years ago by Joe 22k

score 4 · Accepted Answer · 2018-08-07

4

Entering edit mode

6.9 years ago

cschu181 ★ 2.8k

Haven't tried it, but you could do a 2-tiered grep-approach. Make sure your fasta is not line-wrapped.

grep -o "AA[CT][ACGT]\{3\}\([AU]C[ACGT]|AG[CU]\)" fasta_file | grep -v "[ACGT]\{3\}CC[ACGT][ACGT]\{3\}".

Assuming Asn = AAY = AA[CU], Ser = UCN, AGY = UC[ACGT], AG[CU], Thr = ACN = AC[ACGT], and Pro = CCN = CC[ACGT], the first part should match all peptides N-X(traditional = all amino acids)-S/T, the second should get rid off the ones that contain proline in the central position. I am not sure about whether you have to use an additional set of \(\) in the first expression.

ADD COMMENT • link 6.9 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Sound like a good solution, I have tried it, and indeed the extra "\" is necessary. Can you explain what the "\" does here with grep? (I am learning, thanks!).

grep -o "aa[ct][acgt]\{3\}\([at]c[acgt]\|ag[ct]\)" fasta.fa | grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}"

ADD REPLY • link 6.9 years ago by Benn 8.4k

0

Entering edit mode

The backslash escapes special characters, so that they are not expanded by the shell. I never really understand which characters need to to be escaped and which don't... This post gives an overview but some stuff ("may need to be quoted under certain circumstances.") just feels as if one has freshly escaped from an asylum...

ADD REPLY • link 6.9 years ago by cschu181 ★ 2.8k

0

Entering edit mode

Which ones do and don’t need to be escaped depends on your shell, and whether you’re using extended regular expressions ( grep -e vs grep) and some other factors like whether you’re using quotes or not.

ADD REPLY • link 6.9 years ago by Joe 22k

0

Entering edit mode

Yes, but this is all a big, big mess that way...

ADD REPLY • link 6.9 years ago by cschu181 ★ 2.8k

0

Entering edit mode

No worries about the backslashes @cschu181, every user (OP) is responsible to double check if the code given here as answer really does the trick (or tweak a little). In this case I could use your approach (and I liked it, especially the grep -v part). I actually modified it a bit, but the idea was certainly yours. I ended up using fuzznuc (from EMBOSS) with your pattern suggestion, and then grep -v to get rid of the Proline patterns.

fuzznuc -pattern AA[CT]NNN[AT]CN -sequence fasta.fa -outfile prog_pattern_1.txt

fuzznuc -pattern AA[CT]NNNAG[TC] -sequence fasta.fa -outfile prog_pattern_2.txt

cat prog_pattern_1.txt prog_pattern_2.txt | grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" > prog_pattern_no_Pro.txt

So thanks for the help!

ADD REPLY • link 6.9 years ago by Benn 8.4k

0

Entering edit mode

Glad to help. Nice modification with fuzznuc (much more concise than regexing the whole thing.)

ADD REPLY • link 6.9 years ago by cschu181 ★ 2.8k