You can use the following script for that. The usage is perl patt_search.pl fasta_file.fa AATTATA TATA ...
if you save the script as patt_search.pl
. You can give any number of motif sequences. It will recognize IUPAC DNA ambiguity codes. The output is a bit weird because I used it as a feed into another program but it looks like this.
{"chrX:6362554-6365728",{{"TAATTA"}, {260, 2466, 2875}}, {{"CCCCCCCC"}, {1412}}},
{"chrX:6379561-6405165",{{"TAATTA"}, {275, 776, 1048, 1226, 1722, 2753, 3585, 3644, 4951, 5084, 11164, 12712, 16259, 17695, 18211, 18574, 18745, 19204, 19838, 19859, 21405, 23529, 23740, 24372}}, {{"CCCCCCCC"}, {4536, 5673, 9148, 12449, 14132, 16375, 20132, 20140, 21463, 21471, 21975}}},
It contains the fasta header followed by the motif searched for followed by all the locations that it was found on within that sequence.
The program can be downloaded from https://github.com/Farhat/patt_search
ETA: Now it can handle more complicated DNA character strings like TTA{3,7}T and their corresponding reverse complements.
Thanks! You saved me two days at least! :) Does it do rev/comp too?
Yes, it will search for reverse complements too. You can also use IUPAC ambiguity codes and N to match any base.
Does this code also find patterns like
ACA{0,7}TG
and detect patterns as follows in input streamACAAAAAAATG
,ACAATG
,ACAAAATG
be detected? and DoesN
for{A or T or G or C}
also work?As an extension I would like to ask if it is possible to read muliFasta file with the given header? It will be of great help, I can get that done!!
PS: I am not a perl person yet ;) would love to use the code just as it is and format the output to my need (basically a bed file), if it works!!
No, it will not work for general regular expressions. The expansion for N isn't supported but it is a minor change. I'll edit the program to include that.
Thank you very much!!
Just for the record, dna pattern match with some advanced option is available here as part of RSAT tool. However, one cannot integrate this to a analysis pipeline. I would like that... :)
I was actually hoping I can extend this script a bit, to find character repetitions like I mentioned above i.e.,
ACA\{0,7\}TG
to findACAAAAAAATG
andACAATG
and so on....I added
$patt =~ s/\d+/$&/g;
toreplace_ambiguous
subroutine before the return statement.Changed a bit of reverse complement to
$revcomp =~ tr/ACGTacgt[]{}N/TGCAtgca][}{./;
to accomodate paranthesis{ }
.What I end up searching in the FASTA file for reverse strand is a problem.
Eg., Input in argument :
CR\{7,10\}N\{5,8\}ATGC
Generated Forward Strand Look Up:
C[AG]{7,10}[ACGT]{5,8}ATGC
Generated Reverse strand:
GCAT{8,5}[ACGT]{01,7}[CT]G
The reverse complement string is a problem.... I don't think there is a easy way to do it from my limited knowledge... May be can you help me to achieve this???
This is indeed a bit more complicated but can be solved with regular expressions. You can download the modified program at https://github.com/Farhat/patt_search You will have to enclose your patterns in quotes when using it on the command line to prevent shell from parsing braces.