Dear all
I need to create a bed file with all restriction sites of a genome in FASTA. The bed table need to have in parallel the result for three different enzymes and it is necessary to put the strand of site.
An example of an extract table is:
chr18 10786 10790 CCGC 1000 +
chr18 10790 10794 CCGC 1000 +
chr18 10822 10826 CCGC 1000 +
chr18 10828 10832 CCGC 1000 +
chr18 10830 10834 GCGG 1000 -
chr18 10947 10951 GCGG 1000 -
chr18 10976 10980 GCGG 1000 -
chr18 11005 11009 GCGG 1000 -
chr18 11034 11038 GCGG 1000 -
chr18 11063 11067 GCGG 1000 -
chr18 11098 11102 GCGG 1000 -
chr18 11133 11137 GCGG 1000 -
chr18 11255 11259 GCGG 1000 -
chr18 11399 11403 CCGC 1000 +
chr18 11430 11434 CCGC 1000 +
chr18 11738 11742 GCGG 1000 -
chr18 11760 11764 CCGC 1000 +
I am using a script written by Ryan Dale for CGAT modules but it generate an incomplete table for my species.
scaffold_1 571 575
scaffold_1 2276 2280
scaffold_1 2610 2614
scaffold_1 2632 2636
scaffold_1 2639 2643
scaffold_1 2670 2674
scaffold_1 2773 2777
scaffold_1 2860 2864
scaffold_1 4544 4548
scaffold_1 5764 5768
scaffold_1 7321 7325
scaffold_1 7603 7607
How could complete the script (see below) to generate the table I that I wish?
Best and I hope your suggestions
#!/usr/bin/python
usage ="""
Makes a BED file of the cut sites of a specified restriction enzyme.
Example usage:
# Get BED file of DpnI sites in dm3.fa
python restriction-finder.py --fasta dm3.fa --enzyme DpnI --bed DpnI-sites.bed
# can pipe to BedTools to get, e.g, sites in genes::
python restriction-finder.py --fasta myfasta.fa --enzyme DpnI | intersectBed -a stdin -b genes.bed > DpnI-in-genes.bed
Created 13 Aug 2010 by Ryan Dale"""
try:
from Bio import SeqIO
from Bio import Restriction
except ImportError:
sys.stderr.write("\nPlease install BioPython to use this script <http://biopython.org/wiki/Biopython>\n")
import optparse
import sys
import os
op = optparse.OptionParser(usage=usage)
op.add_option('--fasta', help='Required FASTA file containing sequences to search')
op.add_option('--enzyme', help='Required enzyme name, case sensitive (e.g., DpnI or EcoRI)')
op.add_option('--bed',help='BED file to create. If not specified, output will print to stdout.')
options,args = op.parse_args()
# Input error checking...
def err(s):
op.print_help()
sys.stderr.write('\n***ERROR: %s***\n\n'%s)
sys.exit(1)
# Hack to import just the enzyme you want from the Restriction module
if options.enzyme is None:
err('Please specify an enzyme with --enzyme')
if options.fasta is None:
err('Please specify a FASTA file with --fasta')
try:
exec('from Bio.Restriction import %s as restr' % options.enzyme)
except ImportError:
err('No restriction enzyme "%s" found!' % options.enzyme)
if not os.path.exists(options.fasta):
err('FASTA file %s not found'%options.fasta)
if options.bed is None:
fout = sys.stdout
else:
fout = open(options.bed,'w')
# Let BioPython do the work...
parser = SeqIO.parse(options.fasta,'fasta')
for chrom in parser:
sys.stderr.writechrom.name+'\n')
hits = restr.search(chrom.seq)
for hit in hits:
values = [chrom.name,
str(hit),
str(hit+4)]
fout.write('\t'.join(values)+'\n')
fout.flush()
if options.bed is not None:
fout.close()
I think I've fixed all of your code formatting. You can denote something as code be beginning a line with 4 spaces. Particularly with python, it's tough to follow something otherwise.
Have you read the relevant section of the biopython cookbook?
the "strand of site" for a palindromic sequence ?
Not all restriction enzymes are palindromic :)
that's why I said, "palindromic" ;-)
All my enzymes are not palindromic; HpaII AciI HinGI
I see. OK.
see also In silico genome digestion ?/
My problem is here in the script:
I don't know how to incorporate the enzyme name and strand of site in the bed table.
You'll want to create a RestrictionBatch. When you search with that, everytime you run the search you'll get a dict returned, so you'll know which enzyme is doing the cutting. For the strand, you'll have to manually look as I don't think the biopython methods return that (they might assume palindromic sequences, I don't know).