[solved]Extract region between specific genes from genbank file
0
0
Entering edit mode
8.8 years ago
dago ★ 2.8k

I guess my problem come from my poor programming skills, because logically it is quite straightforward.

I have a genbank file, containing few CDS and few other information on the Pfam domain identified in the encoded proteins. I would like to extract the intergenic region between specific genes.

For example:

Considering the for genes: cg_23, cg_24, cg_25, cg_26 (indicate with the respective locus tag) I want to have the intergenic region between cdg_23 and cg_24, cg_24 and cg_25, cg_25 ad cg_26.

I looked many other similar posts (e.g. upstream and downstream regions extraction , Extract User Defined Region From An Fasta File), but I cannot really find a way to apply these tools/approaches to my case.

Any advice?

Using biopython I can extract the position of the CDS in my file.

for f in gb_record.features:
     if f.type == "CDS":
             print f.qualifiers["locus_tag"], f.location

['ctg47_159'] [883:1729](+)
['ctg47_160'] [1771:2932](-)
['ctg47_161'] [3001:3718](+)
['ctg47_162'] [3714:4194](+)
['ctg47_163'] [4256:4664](+)
['ctg47_164'] [4787:5657](+)
['ctg47_165'] [5740:7549](+)
['ctg47_166'] [7570:8908](-)

.....

Now from here I should get the region between the genes; e.g. 1729-1771. Any idea how can move forward?

EDIT

Someone else thought about this long ago. Here a really nice script: http://biopython.org/wiki/Intergenic_regions

It works quite good. I only want to modify the output in the way that I get the locus_tag in the output fasta. Any suggestion?

EDIT

I added the following lines before the two 'for' loop that look in the two strands (modifying -1 to 1 for the other strand):

tag = []
    for f in seq_record.features:
        if f.type == "CDS" and f.strand == -1:
          tag.append(f.qualifiers["locus_tag"])

I then modified the id section as follow:

SeqRecord(intergene_seq,id="%s-%s" % (tag[i],tag[i+1]),
                  description="%s %d-%d %s" % (seq_record.name, last_end+1,
                                                        this_start,strand_string)))

And finally I changed the 'intergene_length' to -10 in hte function definition. In this way I can keep the consequentiality of the tag.

It is kind of dodgy but it does the job.

python genome biopython R • 2.7k views
ADD COMMENT
1
Entering edit mode

What is all this spam below?

ADD REPLY
0
Entering edit mode

which organism ?

ADD REPLY
0
Entering edit mode

Newly annotated bacterial genomes produced in my group.

ADD REPLY
0
Entering edit mode

it's not clear to me what's your input (genbank with a "bacterial genomes produced in your group" ?) and what's your problem as your able to extract the locations of the genes with python.

ADD REPLY
0
Entering edit mode

Thanks for the answer. I have a genbank of bacterial genomes. I can extract the the location of the genes, but I do not manage to find a way to use them to extract the intergeneic region between specific genes.

ADD REPLY

Login before adding your answer.

Traffic: 2147 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6