Entering edit mode
7.6 years ago
chrisgbeldam
▴
20
I am trying to use BioPython to search through a fasta file. I can get BioPython to search and return the entire fasta file. However, I want to be able to then specify a specific uniprot in that fasta file and have BioPython search for it and returns its attributes e.g. ID, seq and so on.
This is the code I have to search through the entire fasta file:
from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
print("%s %s %s %i" % record.id, record, record.seq, len(record)))
Here is what I started writing but doesn't seem to work:
from Bio import SeqIO
for record in SeqIO.parse("CD4.fasta", "fasta"):
for record in [P01730]:
print("%s %s %s %i" %record.id, record, record.seq, len(record)))
Thanks!
Silly comment to make but the script runs and prints nothing. Am I safe to then assume that uniprot is not in the FASTA file?
Try to create a small fasta file with id P01730 and then check the script on that file.
Yeap runs but prints nothing
Here is the fasta file format:
This is because biopython sees id as
sp|P01730|CD4_HUMAN
, rather thanP01730
. I fixed it and updated my answer.Thank you! I didn't realise that's how BioPython see's ids
Last question, why does that only return P01731 as opposed to both of them?
Do you mean both sequence records or both ids?
So now if I want to grab the CD4_HUMAN aspect as the species.
Would I do something like this?
species = record.id.split('|')[2] if '|' in record.id else record.species?
Yes, I can also rewrite the code for you to check whether biopython's id contains your query id. In this way, you wouldn't have to change the code anytime you look for P01731 or CD4_HUMAN. The script would be universal - it will work on any sequences (e.g. from different databases) without changing the code. However, the code would be slower. Do you want me to rewrite this?