I am using BioPython to run a BLASTp on some proteins of interest. With the HSP's, I am wanting to take the returned accession and then fetch the corresponding GenBank file from which this protein is coded from.
For a hopefully simple example, if I got the Chaperone protein DnaK for E. coli K12MG1655 as a return protein; I'd want to be able to back track to the gbk file for E. coli K12MG1655.
Many of the protein accession files do not have a clear "this links you to a GBK file" or... "this is the specific taxid"...
So, my question is, can I do what I am trying to do? Perhaps I don't quite understand these files as well as I need to; but I had hoped that I could go from doing a BLASTp, see the protein accession hits, and then take those values and parse their file to retrieve some extra information to navigate to the correct gbk file.
Any thoughts on my predicament? And of course, let me know if I need to provide some more information / if I am not clear enough, as I will do so as prompt as possible.
And THANK YOU for spending your time helping/reading!
Thank you for the feedback.
So with protein accessions (such as WP_000907403.1, EAW7713904.1, etc...) I can use them to retrieve Genbank files?
When I try this in Python:
Entrez.efetch(db='nuccore', id=id_list[1], rettype="gbwithparts", retmode="text")
where id_list[1] = ADT73839.1returns a HTTPError (Bad Request), which I am presuming it is because it is not accepting the id.
The id's returned from the BLASTp are a wide range of accessions(?) from different databases(?).
ie, the BLASTp result gives me different accession returns...
Hopefully that adds / helps to my predicament. Let me know if you (or anyone) needs a sample of what I am querying etc...
Thanks again!
First things first - ncbi has many different databases, here I've used
nuccore
andprotein
, each holding respective records (nucleotide and protein). Thenuccore
db does not contain any proteins ( protein accession numbers ) so you need to give it nucleotide accession number - in your case, the organism from which your protein originates, not the protein accession. (This is why we need theelink
- to link the databases.)The posted command pipeline works for
ADT73839.1
and retrievesCP002185.1
E.coli.What you need to do is read up on BioPython documentation and replicate the pipeline with biopython.
That is
create
esearch
request, process it, get the uids for the "protein" dbcreate the
elink
request, process it, get the uids for the "nuccore" dbcreate the
efetch
request, actually download the dataThere are some alternatives:
ncbi
eutils
web interface with e.g.requests
library - see this post Entrez epost + elink returns results out of order with Biopython brandnew python package for precisely this kind of job https://pypi.org/project/entrezpy/
Ps.: I'm not sure that there is
gbwithparts
format, try something simple e.g.gb
it might be what you want and there is a possibility that it would interfere with your query construction.I was able to construct a pipeline with all of your recomendations with BioPython! Thank you so much for your replies and time spent!
THANKS!