Hi everyone, I'm trying to extract a CDS based on the product value. Basically want to extract a certain gene from multiple genbank files. This is the code I have so far:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature
from Bio.SeqRecord import SeqRecord
gb_file = open("fimA_seqs.gb", "r")
for gb_record in SeqIO.parse(gb_file, "genbank"):
# now do something with the record
for feat in gb_record.features:
if feat.type == "CDS":
product = feat.qualifiers['product'][0]
if product == 'Porphyromonas gingivalis major fimbrial subunit protein (FimA)' :
So I want to take the nucleotide sequence from any CDS (feature) with that product (qualifier) label and put them all in the same fasta file. Hopefully that makes sense! Any help would be greatly appreciated!!
So i can't incorporate feature.id in my script and work. Are you saying instead of saying:
I should put:
or: if 'FimA' in feature.id:
I also would like my script to use SeqIO.write to make the fasta file.
My mistake, I misunderstood what you were trying to do. Here's what you'd need to try. I haven't implimented a SeqIO write, but that should be fairly straightforward for you.
Note that that is kind of old syntax for building a string these days, but it works fine for this.
Thank you. I was able to adapt this for my own work. The code works great. Cheers.
So its when I used this code, just changed the product with the string i want to find, it gives me no error, but my output file is completely empty. I'm not sure what is going wrong, the code makes sense to me. With no error message i'm still kind of lost.
It needs to match exactly. You'll need to copy in the relevant section of the genbank and your exact search query for me to be able to help you any further. The code works just fine for me with a dummy genbank and random query
yep my "product =" was incorrect. So if i wanted to make have it search for a gene, so it just looks for:
instead of the exact phrase, how would I go about that?
Really easy. Python has nice tricks when it comes to comparing strings. If you simply change:
to
You'll get everything that contains that string. Bear in mind this will still be case sensitive, and will give you every gene that has that string in it's product line. Since product lines can be quite variable you might get some genes back which aren't FimA, but maybe interact with it and mention it in the product line, that kind of thing.
In the test files I'm using for example (a mitochondrial genome GBK) and a random gene, in this case, NADH, I get 6 results in my fasta file:
Thank you so much for your help! that worked!!
If it's all solved, make sure you accept this as an answer :)