Entering edit mode
2.1 years ago
Dario
•
0
Hello,
I have a marker protein that is specific to the type of bacteria I am interested in. I would like to know if there is a way to retrieve all the genomes available in the NCBI that contains that specific protein.
For instance, when I do a DELTA-BLAST of my protein of interest, appear a list of bacteria that contain my protein. However, just show me the protein sequences. I would like to download the genomes of all those bacteria without the need of doing it manually.
Thank you very much in advance.
If you are able to get the accession numbers of those bacterial genomes via delta blast then you can easily download the genomes using tools mentioned here: How to download all Pseudomonas aeruginosa Genomes from NCBI Genomes database?
If you only have protein accessions then perhaps post a couple of examples. I can then show how to link those to genome accessions.
Thank you very much for your response. To be more specific, I am performing BLASTP and DELTABLAST using as query the protein HzsA (GenBank: QII12200.1). This protein is a unique phylogenetic marker for Anammox bacteria. Therefore, I can use as a "bait" to retrieve all the genomes of my bacteria interest. When I do the such blasts, I can download full lists like the ones in the image below. With those list I can take all the accession numbers of that protein, but I would like to use that information to retrieve genomes/assemblies associated to those proteins. Thank you very much in advance
Unfortunately those accession numbers appear to be from
env
collection of sequences.So those are not directly queryable but using the taxID may be an option in EntrezDirect. You can get the accession number for the GenBank assembly and FTP path in column 2.One option to extract the genome assembly accessions is to use a different database with the Entrez tools. The database that connects the protein accession to the genome accession is the Identical Protein Groups (ipg). If you extract the list of accessions from the last column in your BLAST results, you should be able to use them as input. I'm not very familiar with Entrez tools, but you should be able to combine Entrez with NCBI Datasets to retrieve the genome accessions and download the genomes.
So, here's my suggestion:
NCBI datasets
to download the genome assemblies. You can cut the last field to extract the list of genome accessions and use that as input to NCBI datasets.If you want,
datasets
allows you to download not only the genome sequences, but also other files, such as protein, rna, etc, as long as they are available. To do so, you can use the flag--include
and add any files you want.I hope this helps :)