Hey all,
I am interested in getting records particular CDS matches. I have particular BLAST CDS matches that I need to retrieve the Genbank genome assembly ascension number. I have several hundred so I wanted to automate this with Entrez in biopython if possible. My blast results give me a protein ID number, that when searched in NCBI, come up with the metagenome that it is associated with, I was just looking for guidance using biopython to retrieve the Genome assembly GenBank ascension ID. My current script currently isn't even finding the record? which is odd because it shows up when searching NCBI's protein database on the website. I did a search with the below script and had 0 hits.
from Bio import Entrez
Entrez.email = "xxxxx@xxxx.com
handle = Entrez.egquery(term='MAG94467.1')
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
print(row['DbName'], row["Count"])
What is it that you are looking for in the end? You can search the ‘assembly’ database and fetch summary about the assembly but that won't get you the sequence. If you want A specific sequence then you need to figure out which sequence it is you want from the assembly because an assembly can have more than one nucleotide sequence and each of those in turn can have multiple annotated proteins. Perhaps I can help you if you can describe what it is that you are trying to achieve in a little more detail.
So what the end game is, is to append a pandas dataframe with the Genbank Assembly ID (i.e. GCA_XXXX.X). I ran a blast of a concatenated .faa file from thousands of assembled genomes(obtained from a BioProject) through making a local database, and do not have the assembly ID's associated the protein accension numbers anymore. So what I need to do is find the complete Genbank assembly ID 's to add to the data frame and continue form there. The problem is when I search the protein acc ID (i.e. MAG94467.1) I get a contig/WGS acc ID but not the genome assembly ID, although when I search manually on NCBI I see it come up. The problem is I need to automate this as I have alot of different BLAST searches and a lot of different proteins so I am trying to do it through Entrez
Is this final answer to your original question or just additional information? If latter this should either be added to the original post or moved to a comment on top post.
Thanks, I understand this better now. For this purpose, here's what I would do:
Download the entire list of GCA accessions
For each GCA accession, download the feature_table.txt file from NCBI FTP. For example, the feature_table.txt file for the assembly GCA_002691735.1 is here. You should be able to write a bash loop to fetch all of the feature_table.txt files. Either that, or you can go to the BioProject page, click on the 'Assembly' link on the right hand side to list all associated assemblies. Then, click on the blue 'Download Assemblies' button and pick 'Feature table' as your file of choice.
Once you have all of the feature table files, concatenate them, and extract the lines with CDS in column 1, and a protein accession in column 11; column 3 has the GCA accession.
Alright so I am getting the WGS/contig ascencion but not the assembly ID (i.e. starts with GCA_XXXX.X). Any idea on how to retrieve that?
What is it that you are looking for in the end? You can search the ‘assembly’ database and fetch summary about the assembly but that won't get you the sequence. If you want A specific sequence then you need to figure out which sequence it is you want from the assembly because an assembly can have more than one nucleotide sequence and each of those in turn can have multiple annotated proteins. Perhaps I can help you if you can describe what it is that you are trying to achieve in a little more detail.
So what the end game is, is to append a pandas dataframe with the Genbank Assembly ID (i.e. GCA_XXXX.X). I ran a blast of a concatenated .faa file from thousands of assembled genomes(obtained from a BioProject) through making a local database, and do not have the assembly ID's associated the protein accension numbers anymore. So what I need to do is find the complete Genbank assembly ID 's to add to the data frame and continue form there. The problem is when I search the protein acc ID (i.e. MAG94467.1) I get a contig/WGS acc ID but not the genome assembly ID, although when I search manually on NCBI I see it come up. The problem is I need to automate this as I have alot of different BLAST searches and a lot of different proteins so I am trying to do it through Entrez