Noobie to bioinformatics here, I'm strugging with a code that shouldn't be hard. I have a list of 800+ accession numbers for proteins of interest, and I'm trying to get the corresponding protein sequence for all of them.
I've downloaded the FASTA file from Uniprot, and I'm trying to figure out a way to get the sequences in a list using biopython module. So far my code looks something like this:
Creating the original list of 800+ accession numbers (this part is fine)
import openpyxl
file=openpyxl.load_workbook('substrate_1.xlsx')
Y_100= file.get_sheet_by_name ('Supplementary Table 2. Y100Bpa')
rownumber=Y_100.max_row
Acc=[]
for r in range (3, rownumber+1):
Acc.append (Y_100.cell(column=1, row=r).value)
trying (and failing) to parse Fasta
import Bio
from Bio import SeqIO
for seq_record in SeqIO.parse("uniprot.fasta.", "fasta"):
if seq_record.id in Acc: #is this how I would select for only the accession numbers from my original list
print seq_record.id, repr(seq_record.seq, len(seq_record))
else:
continue
So far this code doesn't work at all, what am I doing wrong here? I also tried creating a dictionary instead of list, would that be a better solution?
Thanks in advance from someone lost in the world of bioinformatics
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Just reformmated, first time poster here, thanks for the tip!
When you say "this part is fine", do you mean:
Have you tried simply using a text file with the IDs alongside any tool from one of the gazillion answers on the site addressing "retrieve sequence by identifier" questions?
there is no error/ list was generated.
I've looked some other posts but they seem to not exactly fit my question/ use different modules/programming language. I'll keep looking I guess
Have you checked manually that any of your accessions from your sheet, match the accessions in the fasta?
Note that Biopython by default only uses the header from
>
to the first space, and you're doing direct string comparisons (thoughin
rather than==
is the better choice it is still not guaranteed to work.Can you show us some of the format of your list of accessions and the format of your uniprot fasta?
Yes, i've checked and the accessions match.
Here's a the first line of the list of my accession numbers (cut short for space):
Here's from the FASTA file (I only included SYAC_HUMAN, the first accession number):
ps. actually maybe using
==
is better thanin
, because I realized some of the accession numbers from FASTA are the same as the ones in my list with an extra letter/numberYou code should be
Acc in seq_record.id
not the other way round. This is because you are checking whetherSYAC_HUMAN
is insp|P49588|SYAC_HUMAN Alanine...
and not the other way.Acc
seems to be an array so OP needs to account for that as well.Please use the code (
101010
) formatting option to differentiate programmatic content from other content.