Program For Motif Search
5
3
Entering edit mode
14.6 years ago
Dhivya ▴ 30

Hi,
I want to find a particular motif(GTGGTGGGCC) in Arabidopsis thaliana whole genome.Is there any way to write a program in perl/python.

programming • 7.4k views
ADD COMMENT
1
Entering edit mode

The reply is "of course" :)

ADD REPLY
7
Entering edit mode
14.6 years ago
Neilfws 49k

Since you ask "Is there a way to write a program..." - indeed there is: (1) download files with chromosome sequence, (2) learn about regular expressions in the language of your choice, (3) away you go!

Either Bioperl or Biopython will have methods to do this, but here's a quick Perl guide which assumes that $chrom is a string with the chromosome sequence:

$motif = "GTGGTGGGCC";
while($chrom =~/$motif/g) {
print "Found a match from ".($-[0]+1)." to ".($+[0])."\n";
                  }

This uses the special Perl variables @- and @+, indices containing the start and end of the match, respectively. You add one to $-[0] since indices start from zero, whereas sequence numbering starts from one. Also, you'd want to alter the print line to give you delimited output: chromosome, motif, start, end, strand would be appropriate.

And then you'll want to consider the (-) strand. If not using Bioperl, easy to create using:

$chromrev = reverse($chrom);
$chromrev =~tr/ACGTacgt/TGCAtgca/;

Finally, you'll need to figure out a coordinate system for the (-) strand, remembering the convention that start > end, regardless of strand. I'll leave that as an "exercise for the reader" ;-)

ADD COMMENT
6
Entering edit mode
ADD COMMENT
0
Entering edit mode

nice answer. This tool will be always better than any custom solution or self-made script, and it is already deployed.

ADD REPLY
0
Entering edit mode

And it provides results for download as text, unlike many similar web tools.

ADD REPLY
0
Entering edit mode

If anyone is looking to do this outside Arabidopsis on any sequence, you can use the stand-alone PatMatch described in the original article as well. I have set up a repo where you can launch an active Jupyter notebook system where it works via Binder at https://github.com/fomightez/patmatch-binder .

ADD REPLY
3
Entering edit mode
14.6 years ago

If you are going to write a script to do it (I recommend you to use Pierre's advice), a good python module is TAMO.

It can scan for motifs represented by score matrixes, where you can define multiple bases per each position. For example, you can say that 20% of the times you expect an A and 80% a G. Moreover, it can print sequence logos and much more.

ADD COMMENT
0
Entering edit mode

up 1 for pointing to a python lib

ADD REPLY
3
Entering edit mode
14.6 years ago
brentp 24k

if you're looking for an exact match of that sequence, just using python strings will be quite fast.

for actual motifs, there's also motility which is pretty fast and lets you specify IUPAC motifs or position weight matricies.

ADD COMMENT
0
Entering edit mode

Cool! maybe it is better than TAMO, which I don't know if has been updated since the last time I used it (2~3 years).

ADD REPLY
2
Entering edit mode
14.6 years ago
Stew ★ 1.4k

You could also look at the RSAT tools, particular the "genome-scale dna-pattern" tools in the "Pattern Matching" section, they have Arabidopsis there too.

ADD COMMENT
1
Entering edit mode

Or you could just grep it

ADD REPLY

Login before adding your answer.

Traffic: 2831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6