I want to find the coordinates of all occurrences of the sequence recognized by a restriction enzyme. I know that using EMBOSS I may do this, but this task seems perfectly fitted for short-read sequence alignment software. However, I didn't find any reference for this.
I used bwa for the task and quickly obtained some results. However, to be on the safe side I will like to ask is someone has done something similar or has some advice, perhaps I am stretching the use of bwa.
I tried the following:
echo -e ">DpnII\bGATC" > DnpII.fa
bwa aln -N -n 0 -o 0 -e 0 -l 4 -k 0 dm3 DpnII.fa > DpnII.sai
bwa samse -n 100000000 dm3 DpnII.sai DpnII.fa > DpnII.sam
The results seem to match the right sites, also the number of sites (489570 for Drosophila) that I obtain are close to what I expect.
Thanks.
Why don't you simply use an existing solution like remap? The prime advantage being: remap knows all the common restriction enzyme sequences already. And it is well tested. The Gbrowse plugin on the other hand is very comfortable to use when looking at a certain region of the genome.
At the time I needed something that I could add to our set of tools to process and analyze Hi-C data. Having Emboss or GBrowse2 as dependencies just to find restriction sites was not worth (BWA was already a dependency). The solution using bio-python is so simple that I didn't mind re-implementing it.
Did you check the result then, because I think there is an error in the code with the reverse complement?
I hope there is no bug. I tested the results and so far the only problem I found was with upper/lower case DNA strings that I initially overlooked. What would be your concern with the reverse complement?
don't you need to reverse the complement also? I think you should use the reverse_complement function over complement.
That line returns the reverse complement of the motif (referred as
pattern
)Then, the motif and reverse complement of the motif are searched on the DNA sequence on the fasta file. Finally, the resulting bed file is sorted.
The motif can be a regexp (e.g. AC.GT) and the reverse complement function of the motif only changes the instances of the letters.
I don't think so, why else is there a function reverse_complement in BioSeq. Your test pattern is a bit misleading, because the reverse complement of ACGT is ACGT (not TGCA) !
You are right! Should be the
reverse_complement()
function. Because all restriction enzymes that I know are palindromic I always get the correct answer in my tests. This is so embarrassing :(Well, I checked on the palindromic feature, I found it mentioned that most but not all restriction enzymes cut at palindromic sites. If your sites were palindromes, then you are probably safe with the data generated. I found this list of non-palindromic enzymes: https://www.neb.com/tools-and-resources/selection-charts/enzymes-with-nonpalindromic-sequences
I think that a lot of such 'semi-errors' harbor in random code written from scratch by anyone. This is my argument against writing any new code for these standard problems if avoidable.
I just updated the github page to use the reverse_complement. I tested the new code with one instance of a non-palindromic enzyme and compared the results with Emboss remap. Thanks for pointing out the error.