7N Motif Search Over The Genome
1
4
Entering edit mode
12.6 years ago
PoGibas 5.1k

I do have really short word size (microRNA target sequence).

Want to search enrichment of those motifs in my DNA seq & lots of randomly simulated same length genome sequences. (I am going to RNA->DNA before the search).

What way, what tool should I use for such short motif search?

I have heard about Vmatch, but maybe there is a free software?

Really looking forward to your answers and suggestions...

PS.: Or any simple pl script (within y.fa search x motif) would work fine.

motif • 4.7k views
ADD COMMENT
6
Entering edit mode
12.6 years ago
Farhat ★ 2.9k

You can use the following script for that. The usage is perl patt_search.pl fasta_file.fa AATTATA TATA ... if you save the script as patt_search.pl. You can give any number of motif sequences. It will recognize IUPAC DNA ambiguity codes. The output is a bit weird because I used it as a feed into another program but it looks like this.

{"chrX:6362554-6365728",{{"TAATTA"}, {260, 2466, 2875}}, {{"CCCCCCCC"}, {1412}}},
{"chrX:6379561-6405165",{{"TAATTA"}, {275, 776, 1048, 1226, 1722, 2753, 3585, 3644, 4951, 5084, 11164, 12712, 16259, 17695, 18211, 18574, 18745, 19204, 19838, 19859, 21405, 23529, 23740, 24372}}, {{"CCCCCCCC"}, {4536, 5673, 9148, 12449, 14132, 16375, 20132, 20140, 21463, 21471, 21975}}},

It contains the fasta header followed by the motif searched for followed by all the locations that it was found on within that sequence. The program can be downloaded from https://github.com/Farhat/patt_search

ETA: Now it can handle more complicated DNA character strings like TTA{3,7}T and their corresponding reverse complements.

ADD COMMENT
0
Entering edit mode

Thanks! You saved me two days at least! :) Does it do rev/comp too?

ADD REPLY
0
Entering edit mode

Yes, it will search for reverse complements too. You can also use IUPAC ambiguity codes and N to match any base.

ADD REPLY
0
Entering edit mode

Does this code also find patterns like ACA{0,7}TG and detect patterns as follows in input stream ACAAAAAAATG, ACAATG, ACAAAATG be detected? and Does N for {A or T or G or C} also work?

As an extension I would like to ask if it is possible to read muliFasta file with the given header? It will be of great help, I can get that done!!

PS: I am not a perl person yet ;) would love to use the code just as it is and format the output to my need (basically a bed file), if it works!!

ADD REPLY
1
Entering edit mode

No, it will not work for general regular expressions. The expansion for N isn't supported but it is a minor change. I'll edit the program to include that.

ADD REPLY
0
Entering edit mode

Thank you very much!!

Just for the record, dna pattern match with some advanced option is available here as part of RSAT tool. However, one cannot integrate this to a analysis pipeline. I would like that... :)

ADD REPLY
0
Entering edit mode

I was actually hoping I can extend this script a bit, to find character repetitions like I mentioned above i.e., ACA\{0,7\}TG to find ACAAAAAAATG and ACAATG and so on....

I added $patt =~ s/\d+/$&/g; to replace_ambiguous subroutine before the return statement.

Changed a bit of reverse complement to $revcomp =~ tr/ACGTacgt[]{}N/TGCAtgca][}{./; to accomodate paranthesis { }.

What I end up searching in the FASTA file for reverse strand is a problem.

Eg., Input in argument : CR\{7,10\}N\{5,8\}ATGC

Generated Forward Strand Look Up: C[AG]{7,10}[ACGT]{5,8}ATGC

Generated Reverse strand: GCAT{8,5}[ACGT]{01,7}[CT]G

The reverse complement string is a problem.... I don't think there is a easy way to do it from my limited knowledge... May be can you help me to achieve this???

ADD REPLY
1
Entering edit mode

This is indeed a bit more complicated but can be solved with regular expressions. You can download the modified program at https://github.com/Farhat/patt_search You will have to enclose your patterns in quotes when using it on the command line to prevent shell from parsing braces.

ADD REPLY

Login before adding your answer.

Traffic: 1615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6