Hi everyone, I have been working with tomato (s.lycopersicum) with its wild relative(s.pennellii). So below is a pair of ortholog protein I got previously using some alignment method.
XP_004243382.1--> protein LOC107025503 [Solanum pennellii]
XP_015081778.1-->protein LOC101253299 [Solanum lycopersicum]
you can easily search them on ncbi, so what I have to do is to analyse th promoter sequences of the proteins of all the orthologous proteins. I used the biopython script below to extract the promoter sequnce of all the gene in every chromosome genbank file for both species. genbank Chromosome files can be downloaded through the ftp site of ncbi
ftp://ftp.ncbi.nlm.nih.gov/genomes/Solanum_pennellii/
below is the biopython script
def prom_extract_multi(in_gbk = "in.gbk", prom_len = 1000, file_out = "prom_out.fna"):
from Bio import Seq
from Bio.Seq import Seq
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
prom_out = ""
GBrecord = next(SeqIO.parse(in_gbk, "genbank"))
for feature in GBrecord.features:
if feature.type =="gene":
my_start = feature.location._start.position # Identifies the start position of the gene on the sense strand (5' to 3' irrespective of actual coding strand).
my_end = feature.location._end.position # Identifies the end position of the gene on the sense strand (5' to 3' irrespective of actual coding strand).
start_1000 = my_start - prom_len
end_1000 = my_end + prom_len
if feature.strand == -1:
feat_loc = str(feature.location)
my_prom = GBrecord[my_end:end_1000].reverse_complement()
prom_out += "> Promoter rev_comp" + "___" + feature.qualifiers['db_xref'][0] + "___" + feat_loc + "\n"
prom_out += my_prom.seq.__str__() + "\n\n"
elif feature.strand == 1:
feat_loc = str(feature.location)
my_prom = GBrecord[start_1000:my_start]
prom_out += "> Promoter" + "___" + feature.qualifiers['db_xref'][0] + "___" + feat_loc + "\n"
prom_out += my_prom.seq.__str__()+"\n\n"
file=open(file_out, 'w')
file.write(prom_out)
file.close()
so the output file of the promoter fasta file looks like something like this
Promoter___GeneID:101264245___209673:210580 TCCTAAACGTCGTTTTTAATTGTTTGATTTTTTGTTAAAAAATTAATTTGTGT........
So i tried to get my promoter sequence for the example pair shown above first, I relate the geneID from the output to the protein id(XP_ 004243382.1) of my example pair to get the promoter sequence for both species, however I was only able to get the one s.pennellii, did I miss something here?? I searched the chromosome genbank file and found something could be causing this,
http://www.ncbi.nlm.nih.gov/gene/?term=NW_004194338%20
the link shows “See also 69 discontinued or replaced items.” could this missed/replaced items are the part of my promoter sequence??
Also, can anyone please suggest what I can use to analyse promoter sequences, I tried promoterwise by ebi by failed to install on my laptop, I am thinking just to use blastn , I hope I have explained my question clear enough
thank you very much
When you create the output file handler with the open function, use the option "a" instead of "w" to append the result to the output instead of overwritten it.
There's a bug in your FASTA output: you have a space between ">" and "Promoter_..." which can break many tools which will interpret this as a zero length identifier.
In this line of code prom_out += "> Promoter" + "___" + feature.qualifiers['db_xref'][0] + "___" + feat_loc + "\n"
I am getting KeyError: 'db_xref'
Any suggestions please
Me too premraj.preeti. If you found a solution for this weird error, can you please share? i have Biopython version 1.71 Thanks.