Genomic Restriction Finder
1
1
Entering edit mode
11.0 years ago
rhasbunz ▴ 30

Dear all

I need to create a bed file with all restriction sites of a genome in FASTA. The bed table need to have in parallel the result for three different enzymes and it is necessary to put the strand of site.

An example of an extract table is:

chr18 10786 10790 CCGC 1000 +
chr18 10790 10794 CCGC 1000 +
chr18 10822 10826 CCGC 1000 +
chr18 10828 10832 CCGC 1000 +
chr18 10830 10834 GCGG 1000 -
chr18 10947 10951 GCGG 1000 -
chr18 10976 10980 GCGG 1000 -
chr18 11005 11009 GCGG 1000 -
chr18 11034 11038 GCGG 1000 -
chr18 11063 11067 GCGG 1000 -
chr18 11098 11102 GCGG 1000 -
chr18 11133 11137 GCGG 1000 -
chr18 11255 11259 GCGG 1000 -
chr18 11399 11403 CCGC 1000 +
chr18 11430 11434 CCGC 1000 +
chr18 11738 11742 GCGG 1000 -
chr18 11760 11764 CCGC 1000 +

I am using a script written by Ryan Dale for CGAT modules but it generate an incomplete table for my species.

scaffold_1    571    575
scaffold_1    2276    2280
scaffold_1    2610    2614
scaffold_1    2632    2636
scaffold_1    2639    2643
scaffold_1    2670    2674
scaffold_1    2773    2777
scaffold_1    2860    2864
scaffold_1    4544    4548
scaffold_1    5764    5768
scaffold_1    7321    7325
scaffold_1    7603    7607

How could complete the script (see below) to generate the table I that I wish?

Best and I hope your suggestions


#!/usr/bin/python
usage ="""

Makes a BED file of the cut sites of a specified restriction enzyme.

Example usage:

# Get BED file of DpnI sites in dm3.fa
python restriction-finder.py --fasta dm3.fa --enzyme DpnI --bed DpnI-sites.bed

# can pipe to BedTools to get, e.g, sites in genes::
python restriction-finder.py --fasta myfasta.fa --enzyme DpnI | intersectBed -a stdin -b genes.bed > DpnI-in-genes.bed


Created 13 Aug 2010 by Ryan Dale"""
try:
    from Bio import SeqIO
    from Bio import Restriction
except ImportError:
    sys.stderr.write("\nPlease install BioPython to use this script <http://biopython.org/wiki/Biopython>\n")
import optparse
import sys
import os

op = optparse.OptionParser(usage=usage)
op.add_option('--fasta', help='Required FASTA file containing sequences to search')
op.add_option('--enzyme', help='Required enzyme name, case sensitive (e.g., DpnI or EcoRI)')
op.add_option('--bed',help='BED file to create. If not specified, output will print to stdout.')
options,args = op.parse_args()

# Input error checking...
def err(s):
    op.print_help()
    sys.stderr.write('\n***ERROR: %s***\n\n'%s)
    sys.exit(1)
# Hack to import just the enzyme you want from the Restriction module
if options.enzyme is None:
    err('Please specify an enzyme with --enzyme')
if options.fasta is None:
    err('Please specify a FASTA file with --fasta')
try:
    exec('from Bio.Restriction import %s as restr' % options.enzyme)
except ImportError:
    err('No restriction enzyme "%s" found!' % options.enzyme)

if not os.path.exists(options.fasta):
    err('FASTA file %s not found'%options.fasta)

if options.bed is None:
    fout = sys.stdout
else:
    fout = open(options.bed,'w')


# Let BioPython do the work...
parser = SeqIO.parse(options.fasta,'fasta')
for chrom in parser:
    sys.stderr.writechrom.name+'\n')
    hits = restr.search(chrom.seq)
    for hit in hits:
        values = [chrom.name,
                  str(hit),
                  str(hit+4)]
        fout.write('\t'.join(values)+'\n')
        fout.flush()
if options.bed is not None:
    fout.close()
fasta • 11k views
ADD COMMENT
0
Entering edit mode

I think I've fixed all of your code formatting. You can denote something as code be beginning a line with 4 spaces. Particularly with python, it's tough to follow something otherwise.

ADD REPLY
0
Entering edit mode

Have you read the relevant section of the biopython cookbook?

ADD REPLY
0
Entering edit mode

the "strand of site" for a palindromic sequence ?

ADD REPLY
0
Entering edit mode

Not all restriction enzymes are palindromic :)

ADD REPLY
0
Entering edit mode

that's why I said, "palindromic" ;-)

ADD REPLY
0
Entering edit mode

All my enzymes are not palindromic; HpaII AciI HinGI

ADD REPLY
0
Entering edit mode

I see. OK.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

My problem is here in the script:

for chrom in parser:
    sys.stderr.writechrom.name+'\n')
    hits = restr.search(chrom.seq)
    for hit in hits:
        values = [chrom.name, str(hit), str(hit+4)]
        fout.write('\t'.join(values)+'\n')
        fout.flush()

I don't know how to incorporate the enzyme name and strand of site in the bed table.

ADD REPLY
0
Entering edit mode

You'll want to create a RestrictionBatch. When you search with that, everytime you run the search you'll get a dict returned, so you'll know which enzyme is doing the cutting. For the strand, you'll have to manually look as I don't think the biopython methods return that (they might assume palindromic sequences, I don't know).

ADD REPLY
2
Entering edit mode
11.0 years ago

I wrote a tool to digest a genome: https://github.com/lindenb/jvarkit/wiki/Biostar86480

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr3.fa.gz" |\
gunzip -c  |\
java -jar dist/biostar86480.jar -E AarI -E EcoRI 



chr3    60645    60651    GAATTC    1000    +    EcoRI    G^AATTC
chr3    60953    60959    GAATTC    1000    +    EcoRI    G^AATTC
chr3    68165    68172    GCAGGTG    1000    -    AarI    CACCTGC(4/8)
chr3    70263    70269    GAATTC    1000    +    EcoRI    G^AATTC
chr3    70945    70952    GCAGGTG    1000    -    AarI    CACCTGC(4/8)
chr3    71140    71146    GAATTC    1000    +    EcoRI    G^AATTC
chr3    72264    72270    GAATTC    1000    +    EcoRI    G^AATTC
chr3    74150    74156    GAATTC    1000    +    EcoRI    G^AATTC
chr3    75063    75069    GAATTC    1000    +    EcoRI    G^AATTC
chr3    78438    78444    GAATTC    1000    +    EcoRI    G^AATTC
chr3    81052    81059    CACCTGC    1000    +    AarI    CACCTGC(4/8)
chr3    84498    84504    GAATTC    1000    +    EcoRI    G^AATTC
chr3    84546    84552    GAATTC    1000    +    EcoRI    G^AATTC
chr3    84780    84787    CACCTGC    1000    +    AarI    CACCTGC(4/8)
chr3    87771    87777    GAATTC    1000    +    EcoRI    G^AATTC
chr3    95344    95351    GCAGGTG    1000    -    AarI    CACCTGC(4/8)
chr3    96358    96364    GAATTC    1000    +    EcoRI    G^AATTC
chr3    96734    96740    GAATTC    1000    +    EcoRI    G^AATTC
chr3    105956    105962    GAATTC    1000    +    EcoRI    G^AATTC
chr3    107451    107457    GAATTC    1000    +    EcoRI    G^AATTC
ADD COMMENT
0
Entering edit mode

Hi Pierre, thanks for the answer. I use your instruction but the result was the next:

curl -s "ftp://ftp.jgi-psf.org/pub/compgen/phytozome/v9.0/Egrandis/assembly/Egrandis_201.fa.gz" |\

gunzip -c |\

java -jar dist/biostar86480.jar -E HpaII AciI Hin6I

Unable to access jarfile dist/biostar86480.jar

ADD REPLY
0
Entering edit mode

did you compile the tool ?

ADD REPLY
0
Entering edit mode

Hi Pierre, no doubt that the development tool is perfect for my purpose but I can't install jvarkit. Given my inexperience I don't understand the instruction in https://github.com/lindenb/jvarkit/wiki/Compilation#edit-buildproperties Can you help me?

Rodrigo

$ ant biostar86480 Buildfile: /Users/rhasbun/jvarkit/build.xml

biostar86480: [echo] if the next javac statement fails, check that the 'build.properties' has been properly configured [javac] Compiling 1 source file to /Users/rhasbun/jvarkit/tmp

BUILD FAILED /Users/rhasbun/jvarkit/build.xml:974: The following error occurred while executing this line: /Users/rhasbun/jvarkit/build.xml:213: The following error occurred while executing this line: /Users/rhasbun/jvarkit/build.xml:153: /commun/data/packages/picard-tools-1.100 does not exist.

ADD REPLY
0
Entering edit mode

download and extract the picard library from https://sourceforge.net/projects/picard/files/picard-tools/1.100/ with a text editor open build.properties and set the path in picard.dir=/path/to/the/directory/containing/picard-100

ADD REPLY
0
Entering edit mode

Perfect Pierre, It's work!!! Only one question: How can incorporate one enzyme? For Hin6I the error message is: Cannot find enzyme null in RE list

Best

ADD REPLY
0
Entering edit mode

You're rright, it's missing in src/main/java/com/github/lindenb/jvarkit/util/bio/Rebase.java . Add it in the code and recompile or use HhaI (cuts GCG^C)

ADD REPLY
0
Entering edit mode

Fantastic...it's perfect....thanks so much Pierre

ADD REPLY
0
Entering edit mode

Hi Pierre, it's possible to use this tool for detect all CpG sites of a genome in FASTA? The output formate (bed table) would be the same

ADD REPLY
0
Entering edit mode

ask this as a new question please.

ADD REPLY
0
Entering edit mode

Hi Pierre,

I would like to apply the tool for a local reference genome stored in my MAC, How can make it?

Best

Rodrigo

ADD REPLY
0
Entering edit mode

java -jar dist/biostar86480.jar -E AarI -E EcoRI < your.genome.fa

ADD REPLY
0
Entering edit mode

Hi Pierre, I am glad to find your Biostar86480. It looks great to get the enzyme position. But I am new for bioinformatics and my serve is very old. I got error: SSL connect error while accessing https://github.com/lindenb/jvarkit.git/info/refs fatal: HTTP request failed when I put this: git clone "https://github.com/lindenb/jvarkit.git". Where can I download or could you send me your software by email? Thanks.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi pierre,

Thank you very much. I got it now. If I have any question, I will ask you again.

ADD REPLY
0
Entering edit mode

Hi Pierre, thank you for your tool that is very useful for me. But still one question, can i output the results to a file and count the number of the sites automaticly for each enzyme?

ADD REPLY
0
Entering edit mode

"can i output the results to a file" : see http://www.tldp.org/LDP/abs/html/io-redirection.html

"count the number of the sites automaticly for each enzyme" : http://www.linuxjournal.com/article/7396

ADD REPLY
0
Entering edit mode

Sorry, I am the beginner of linux, thank you for your help. Additional no related question is OK?

I have analyzed the HpaII and MboI sites in all promoter sequence of refgenes. I get the results as follows:

For example,

NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    109    113    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    176    180    CCGG    1000    +    HpaII    C^CGG
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    204    208    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    831    835    CCGG    1000    +    HpaII    C^CGG
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    980    984    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    1047    1051    CCGG    1000    +    HpaII    C^CGG
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    1358    1362    GATC    1000    +    MboI    ^GATC
NM_028778_up_2000_chr1_134210715_f chr1:134210715-134212714    1911    1915    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    717    721    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1361    1365    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1791    1795    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1860    1864    CCGG    1000    +    HpaII    C^CGG
NM_027671_up_2000_chr1_9289812_r chr1:9289812-9291811    1954    1958    CCGG    1000    +    HpaII    C^CGG

So next I want to pick the genes in which the seqences cantain both HpaII and MspI sites, if possible set the at least number for each site as parameter, then print out the name of the genes (Line1 of the results) and the number for each site, how should I do, could you help me , thank you very much!

ADD REPLY
0
Entering edit mode

Hi Pierre, Thanks a lot for this awesome tool. I am not sure if commenting on this old thread is good, but I had a quick question about the tool. I wanted to cut hg18 using 2 enzymes whose name I do not know (it's provided by a company) but I do know that the restriction sites for the 2 are - GATC^, and GANT^C (N=A/T/G/C). Do you think there is a way to do this using this tool? Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6