I am trying to compare the synteny of a homologous gene of a set of embl files. I wrote the following script in Biopython which does not work when the protein is at the beginning/end of the plasmid: http://www.ebi.ac.uk/ena/data/view/CP001187&display=text I only to display the region of interest and not the whole gene. For example I would like to get something like this:
-> -> -> => -> -> ->
which works fine when the protein/gen => is in the middle of the genome but if it in the end of the genome I get
-> -> -> -> -> -> =>
And the length of arrows become very short since I am plotting the whole genome.
If the gene is at the beginning of the genome I obtain
=> -> -> -> -> -> ->
So it does not take into account the circular nature of the plasmid. program output
from Bio.SeqIO import parse
from Bio.Graphics.GenomeDiagram import Diagram
n=1 # name of each track
cds=[] # index of all cds
pro=[] # index of target genes
# protein ids to look for
id0=set( 'ACK98665.1 ACK98642.1 ACK98513.1'.split() )
gdd=Diagram()
p='protein_id'
for rec in parse('CP001187.embl', 'embl'):
for j,i in enumerate(rec.features):
if i.type=='CDS':
# index of all cds
cds.append(j)
# index of protein ids
if p in i.qualifiers and i.qualifiers[p][0] in id0:
pro.append(j)
print pro, cds
for j in pro: # for each protein
gdfs=gdd.new_track(2, start=0,greytrack=True).new_set()
for i in range(-3,4):
idx=cds.index(j)+i
if idx>=len(cds): # locus of cds
locus = cds[idx - len(cds) ]
else:
locus = cds[idx]
gdfs.add_feature(rec.features[locus],sigil="ARROW")
n+=1
gdd.draw( circular=0, format='linear',fragments=1,)
gdd.write('test'+str(n)+'.pdf', 'PDF')
Showing a (hand drawn sketch) of the desired output would help. What do you mean by does not work? I suspect this is a bug in your code in the complicated list indexing (what is all this i/j stuff meant to achieve?). Why not add all the CDS features to the track immediately within the for loop over
rec.features
? Also there is an indentation problem with the loopfor i in pro:
where you might be trying to make a new track for each protein?I updated the question, the code and the output. I only want to show the CDS +/-3 of the genes of interest. So I first make the index of CDS and genes of interest. Copy paste from gedit seem to mess up the indentions.
No typos in the title also helps :) Fixed systeny -> synteny