Question

How To Use Bwa To Find Restriction Enzyme Coordinates In A Genome

2

Entering edit mode

12.8 years ago

Fidel ★ 2.0k

I want to find the coordinates of all occurrences of the sequence recognized by a restriction enzyme. I know that using EMBOSS I may do this, but this task seems perfectly fitted for short-read sequence alignment software. However, I didn't find any reference for this.

I used bwa for the task and quickly obtained some results. However, to be on the safe side I will like to ask is someone has done something similar or has some advice, perhaps I am stretching the use of bwa.

I tried the following:

echo -e ">DpnII\bGATC" > DnpII.fa
bwa aln -N -n 0 -o 0 -e 0 -l 4 -k 0 dm3 DpnII.fa > DpnII.sai
bwa samse -n 100000000 dm3 DpnII.sai DpnII.fa > DpnII.sam

The results seem to match the right sites, also the number of sites (489570 for Drosophila) that I obtain are close to what I expect.

Thanks.

bwa enzyme • 5.2k views

ADD COMMENT • link 7.9 years ago by Fidel ★ 2.0k

1

Entering edit mode

12.8 years ago

Michael 55k

Finding these sites is an exact matching problem, so anything that can do exact matching should work. However there are some established tools, that you could use as well. I would at least compare the results to what remap delivers.

GBrowse2 contains a plug-in for computing and visualizing restriction enzyme sites. It is activated in the standard install but unfortunately I didn't see it in flybase. You would therefore have to download the genome sequence and put it in a local instance.

EMBOSS has some tools too like remap. There is also a web interface to remap here.

ADD COMMENT • link 12.8 years ago by Michael 55k

score 2 · Accepted Answer · 2017-01-10

2

Entering edit mode

7.9 years ago

Fidel ★ 2.0k

Here is an update for people landing on this question: The bwa method that I proposed does not work properly. I don't remember exactly why it was wrong but I stop using it. I think it was returning regions with NNNs.

My solution was to use Biopython to search for a pattern. This is quite fast and accurate. Here is the code that finds and sorts the results:

https://github.com/maxplanck-ie/HiCExplorer/blob/master/hicexplorer/findRestSite.py

[update] Based on the comments by Michael Dondrup I fixed an error in the code.

ADD COMMENT • link 7.9 years ago by Fidel ★ 2.0k

0

Entering edit mode

Why don't you simply use an existing solution like remap? The prime advantage being: remap knows all the common restriction enzyme sequences already. And it is well tested. The Gbrowse plugin on the other hand is very comfortable to use when looking at a certain region of the genome.

ADD REPLY • link 7.9 years ago by Michael 55k

0

Entering edit mode

At the time I needed something that I could add to our set of tools to process and analyze Hi-C data. Having Emboss or GBrowse2 as dependencies just to find restriction sites was not worth (BWA was already a dependency). The solution using bio-python is so simple that I didn't mind re-implementing it.

ADD REPLY • link 7.9 years ago by Fidel ★ 2.0k

0

Entering edit mode

Did you check the result then, because I think there is an error in the code with the reverse complement?

ADD REPLY • link 7.9 years ago by Michael 55k

0

Entering edit mode

I hope there is no bug. I tested the results and so far the only problem I found was with upper/lower case DNA strings that I initially overlooked. What would be your concern with the reverse complement?

ADD REPLY • link 7.9 years ago by Fidel ★ 2.0k

0

Entering edit mode

rev_compl = str(Seq(pattern, generic_dna).complement())

don't you need to reverse the complement also? I think you should use the reverse_complement function over complement.

ADD REPLY • link 7.9 years ago by Michael 55k

0

Entering edit mode

That line returns the reverse complement of the motif (referred as pattern)

pattern = "ACGT" 
str(Seq(pattern, generic_dna).complement())
'TGCA'

Then, the motif and reverse complement of the motif are searched on the DNA sequence on the fasta file. Finally, the resulting bed file is sorted.

The motif can be a regexp (e.g. AC.GT) and the reverse complement function of the motif only changes the instances of the letters.

ADD REPLY • link 7.9 years ago by Fidel ★ 2.0k

0

Entering edit mode

I don't think so, why else is there a function reverse_complement in BioSeq. Your test pattern is a bit misleading, because the reverse complement of ACGT is ACGT (not TGCA) !

ADD REPLY • link 7.9 years ago by Michael 55k

1

Entering edit mode

You are right! Should be the reverse_complement() function. Because all restriction enzymes that I know are palindromic I always get the correct answer in my tests. This is so embarrassing :(

ADD REPLY • link 7.9 years ago by Fidel ★ 2.0k

1

Entering edit mode

Well, I checked on the palindromic feature, I found it mentioned that most but not all restriction enzymes cut at palindromic sites. If your sites were palindromes, then you are probably safe with the data generated. I found this list of non-palindromic enzymes: https://www.neb.com/tools-and-resources/selection-charts/enzymes-with-nonpalindromic-sequences

I think that a lot of such 'semi-errors' harbor in random code written from scratch by anyone. This is my argument against writing any new code for these standard problems if avoidable.

ADD REPLY • link 7.9 years ago by Michael 55k

1

Entering edit mode

I just updated the github page to use the reverse_complement. I tested the new code with one instance of a non-palindromic enzyme and compared the results with Emboss remap. Thanks for pointing out the error.

ADD REPLY • link 7.9 years ago by Fidel ★ 2.0k