Entering edit mode
5.7 years ago
PaSua
•
0
Hello everyone, Python newby here.
I´m currently working with transcription factor binding sites, and I have several sequences of binding sites for which I don´t know their position in the genome. For instance, I have this sequence "TGTAAACCTTTTCA", which belongs to NC_009004.1, and I was wondering if there is a way for finding the sequence position (start and end) using biopython or any other approach.
Sorry if this is a simple or naive question, but I've been trying to solve it by myself checking some videos, books and cookbooks and so far I've got nothing.
Thank you in advance.
If your genome is not too long you can try the pairwise alignment from BioPython http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
Otherwise you'll have to align this sequence using a proper alignment software like BWA or Bowtie2
Edit : Lactococcus lactis, 2.5M bases
You can give a try to BioPython and check the running time
Transcription factors bind to motifs which can have some variation in nucleotide composition. Better use a dedicated tool such as
fimo
from the MEME suite for this. You pattern matching approach would require 100% sequence identity which is simply not how transcription factor binding works.BLAST would be the obvious choice, but expect lots of hits so you'll have to do some filtering a posteriori.
(You needn't use BLAST via python, but you could if you wanted to).
you can use seqkit locate @ PaSua
Example input:
output: