Question

Is there a code to find consensus motif

0

Entering edit mode

7.0 years ago

vinayjrao ▴ 260

Hello,

I've been trying to write a code to find a consensus motif in a given sequence, and for this purpose, I was only able to reach till finding a substring in a string. I want to be able to allot multiple nucleotides/amino acids at each position, and also enter N/X representing any of the nucleotides/amino acids. I would very much appreciate any help.

Thanks.

P.S. The post tags represent the languages I'm comfortable understanding.

Edit: Example of the consensus motif - A/T A A G C A A/T/G N N A

Sequence - CGATCGTG TAAGCAGCTA GTCATG

Bolded sequence is the consensus

C awk shell python • 3.1k views

ADD COMMENT • link updated 7.0 years ago by Carlo Yague 9.0k • written 7.0 years ago by vinayjrao ▴ 260

score 1 · Accepted Answer · 2018-08-07

1

Entering edit mode

7.0 years ago

Carlo Yague 9.0k

In shell using grep and regular expressions:

echo 'CGATCGTG TAAGCAGCTA GTCATG' | grep  -o "[AT]AAGCA[ATG]..A"
TAAGCAGCTA

'N' is expressed as '.', meaning that it can take any value. Multiple nucleotides at one position are put into square brackets.

ADD COMMENT • link 7.0 years ago by Carlo Yague 9.0k

0

Entering edit mode

Thanks a lot. It's perfect.

ADD REPLY • link 7.0 years ago by vinayjrao ▴ 260

1

Entering edit mode

In the same lines of Carlo Yague

echo 'CGATCGTG TAAGCAGCTA GTCATG' | grep -Po \([AT]\)A{2}GCA[\1G].{2}A
TAAGCAGCTA

ADD REPLY • link 7.0 years ago by cpad0112 21k

0

Entering edit mode

Thanks. This works too. I could use the .{2} when I have larger repeats of any nucleotide/amino acid. Although, I would like to know why it [\1G] and not [ATG]?

ADD REPLY • link 7.0 years ago by vinayjrao ▴ 260

0

Entering edit mode

The first AT is made a group and every time and anywhere you can call it by its serial number (1 here)