Find binding sites start and end positions in genome using biophyton
0
0
Entering edit mode
5.7 years ago
PaSua • 0

Hello everyone, Python newby here.

I´m currently working with transcription factor binding sites, and I have several sequences of binding sites for which I don´t know their position in the genome. For instance, I have this sequence "TGTAAACCTTTTCA", which belongs to NC_009004.1, and I was wondering if there is a way for finding the sequence position (start and end) using biopython or any other approach.

Sorry if this is a simple or naive question, but I've been trying to solve it by myself checking some videos, books and cookbooks and so far I've got nothing.

Thank you in advance.

python biopython binding sites position • 2.0k views
ADD COMMENT
0
Entering edit mode

If your genome is not too long you can try the pairwise alignment from BioPython http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

Otherwise you'll have to align this sequence using a proper alignment software like BWA or Bowtie2

Edit : Lactococcus lactis, 2.5M bases

You can give a try to BioPython and check the running time

ADD REPLY
0
Entering edit mode

Transcription factors bind to motifs which can have some variation in nucleotide composition. Better use a dedicated tool such as fimo from the MEME suite for this. You pattern matching approach would require 100% sequence identity which is simply not how transcription factor binding works.

ADD REPLY
0
Entering edit mode

BLAST would be the obvious choice, but expect lots of hits so you'll have to do some filtering a posteriori.

(You needn't use BLAST via python, but you could if you wanted to).

ADD REPLY
0
Entering edit mode

you can use seqkit locate @ PaSua

Example input:

$ cat test.fa 
>a
TGTAAACCTTTTCATACTEAAGATTTGTAAACCTTTTCATGACCGTAGTGTAAACCTTTTCA
>b
ATCGATGCGATTGTAAACCTTTTCAATGCGATGACTGTAAACCTTTTCA

output:

$ seqkit locate -idp "TGTAAACCTTTTCA" test.fa

seqID   patternName pattern strand  start   end matched
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   1   14  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   26  39  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   49  62  TGTAAACCTTTTCA
b   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   12  25  TGTAAACCTTTTCA
ADD REPLY

Login before adding your answer.

Traffic: 2006 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6