I have an sequence-specific enzyme for cutting the DNA. The enzyme recognize a fixed 6bps sequence (e.g. ACTAGT) and cut both strands of the DNA at a specific location (e.g. A|CTAGT).
I was wondering if there is a way I can find out all the locations of possible cutting sites in the genome. That is, I'm looking for the location of all the sequence "ACTAGT" (perfect match only) along the entire genome.
I have the FASTA files from each of the chromosome (chr1.fa, chr2.fa, etc.), from hg19 database.
I considered using Bowtie2, with -a
option, but after reading the manual I think the program was not designed for this purpose and they warned that "it could be extremely slow". I was thinking if there is a (possibly lightweight) program that was designed specifically for this.
Thank you
A RegEx in Perl/Python should work fine, BioPerl/BioPython included functions for this.
Something like this:
and run:
In case this is a homework question, for extra marks, you might want to consider how polymorphisms could affect the result.
Even more embarrassingly, it is not. :D
By polymorphism you mean SNPs on the genome? Or on the enzyme?
Don't forget to ask how do position and GC-content mutation frequency affects the sites too.