Guys, I've wrote a script to extract sequences between the 23S rRNA and 16S rRNA gene in Python using Biopython. I can't figure out why, even if I've set explicit conditions, it keeps extracting randome sequences into the genome. Here's the code:
for f in glob.glob(cwd+'/genomes/*.gbk'):
gbank = SeqIO.parse(open(f,'r'), "genbank")
with open(output_file, "w") as nfh:
for genome in gbank:
for gene in genome.features:
if gene.type=='rRNA':
if gene.location.strand == -1: # wanted just the reverse strand
if 'product' in gene.qualifiers:
if '16S' in gene.qualifiers['product'][0]:
end = gene.location.nofuzzy_start
elif '23S' in gene.qualifiers['product'][0]:
start = gene.location.nofuzzy_end
if abs(start - end) < 50000:
intergenic_sequence.append(">%s|ITS_full_sequence|%s\n%s" % (genome.description, gene.location.strand, genome.seq[start:end]))
I've set gene.type
to rRNA
and it extract a sequence into a random CDS, nothing to do with ribosomal RNA. This problem happens with several genomes tested (RefSeq records, well annotated).
You could give my code a try: https://github.com/jrjhealey/bioinfo-tools/blob/master/get_spans.py
I've not tested it with rRNA features (or much at all frankly).
It might be worth reading your files in an interactive python session. It's possible the 'key'
rRNA
isn't what's exactly in the genbank file (though I'm clutching at straws there a bit).At a glance, I can't see any obvious issue with your code. Perhaps take a step back and remove the
if '16S'
/if '23S'
conditionals, and just ensure you're getting results you'd expect to get at each stage, before adding in another layer of complexity/filtering.Side note, you dont need to use
open()
withSeqIO.parse
- it handles all that internally.The key
rRNA
seems to work if I want to extract a 16S rRNA gene. Every genome tested was open previously withartemis
where the search by key gives the right output every time. By far, I don't get why other condition values true even if they're explicitly wrong. I'm looking for gene with16S
inside product qualifiers, how could be even possible that apilus
associated CDS was extracted? May this be a bug or something else?Can you provide a link to one of your input genbanks and I'll have a little play with the data myself.
Sure, here's one. Bifidobacterium bifidum PRL2010 (https://www.ncbi.nlm.nih.gov/assembly/GCA_000165905.1)
Another small comment, I assume its elsewhere in your script, but you haven't declared the vector
intergenic_sequence
in that code snippet, so as posted, it would fail based on that alone.intergenic sequence
is a list declared before, not copied just to be short.