Hi,
I want to find a particular motif(GTGGTGGGCC) in Arabidopsis thaliana whole genome.Is there any way to write a program in perl/python.
Hi,
I want to find a particular motif(GTGGTGGGCC) in Arabidopsis thaliana whole genome.Is there any way to write a program in perl/python.
Since you ask "Is there a way to write a program..." - indeed there is: (1) download files with chromosome sequence, (2) learn about regular expressions in the language of your choice, (3) away you go!
Either Bioperl or Biopython will have methods to do this, but here's a quick Perl guide which assumes that $chrom is a string with the chromosome sequence:
$motif = "GTGGTGGGCC";
while($chrom =~/$motif/g) {
print "Found a match from ".($-[0]+1)." to ".($+[0])."\n";
}
This uses the special Perl variables @- and @+, indices containing the start and end of the match, respectively. You add one to $-[0] since indices start from zero, whereas sequence numbering starts from one. Also, you'd want to alter the print line to give you delimited output: chromosome, motif, start, end, strand would be appropriate.
And then you'll want to consider the (-) strand. If not using Bioperl, easy to create using:
$chromrev = reverse($chrom);
$chromrev =~tr/ACGTacgt/TGCAtgca/;
Finally, you'll need to figure out a coordinate system for the (-) strand, remembering the convention that start > end, regardless of strand. I'll leave that as an "exercise for the reader" ;-)
If anyone is looking to do this outside Arabidopsis on any sequence, you can use the stand-alone PatMatch described in the original article as well. I have set up a repo where you can launch an active Jupyter notebook system where it works via Binder at https://github.com/fomightez/patmatch-binder .
If you are going to write a script to do it (I recommend you to use Pierre's advice), a good python module is TAMO.
It can scan for motifs represented by score matrixes, where you can define multiple bases per each position. For example, you can say that 20% of the times you expect an A and 80% a G. Moreover, it can print sequence logos and much more.
if you're looking for an exact match of that sequence, just using python strings will be quite fast.
for actual motifs, there's also motility which is pretty fast and lets you specify IUPAC motifs or position weight matricies.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
The reply is "of course" :)