Question

How to retrieve all the genomes that contain a specific protein?

0

Entering edit mode

2.1 years ago

Dario • 0

Hello,

I have a marker protein that is specific to the type of bacteria I am interested in. I would like to know if there is a way to retrieve all the genomes available in the NCBI that contains that specific protein.

For instance, when I do a DELTA-BLAST of my protein of interest, appear a list of bacteria that contain my protein. However, just show me the protein sequences. I would like to download the genomes of all those bacteria without the need of doing it manually.

Thank you very much in advance.

genomes NCBI Retrieve from • 930 views

ADD COMMENT • link updated 2.0 years ago by MirianT_NCBI ▴ 760 • written 2.1 years ago by Dario • 0

0

Entering edit mode

If you are able to get the accession numbers of those bacterial genomes via delta blast then you can easily download the genomes using tools mentioned here: How to download all Pseudomonas aeruginosa Genomes from NCBI Genomes database?

If you only have protein accessions then perhaps post a couple of examples. I can then show how to link those to genome accessions.

ADD REPLY • link 2.1 years ago by GenoMax 147k

0

Entering edit mode

Thank you very much for your response. To be more specific, I am performing BLASTP and DELTABLAST using as query the protein HzsA (GenBank: QII12200.1). This protein is a unique phylogenetic marker for Anammox bacteria. Therefore, I can use as a "bait" to retrieve all the genomes of my bacteria interest. When I do the such blasts, I can download full lists like the ones in the image below. With those list I can take all the accession numbers of that protein, but I would like to use that information to retrieve genomes/assemblies associated to those proteins. Thank you very much in advance

enter image description here

ADD REPLY • link 2.1 years ago by Dario • 0

0

Entering edit mode

Unfortunately those accession numbers appear to be from env collection of sequences.So those are not directly queryable but using the taxID may be an option in EntrezDirect. You can get the accession number for the GenBank assembly and FTP path in column 2.

$  esearch -db assembly -query "174633 [taxid]" | esummary | xtract -pattern DocumentSummary -element Genbank,FtpPath
    GCA_945876185.1 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/945/876/185/GCA_945876185.1_AH-24oct19-121/GCA_945876185.1_AH-24oct19-121_assembly_report.txt    ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/945/876/185/GCA_945876185.1_AH-24oct19-121   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/945/876/185/GCA_945876185.1_AH-24oct19-121/GCA_945876185.1_AH-24oct19-121_assembly_stats.txt
    GCA_945873025.1 ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/945/873/025/GCA_945873025.1_MoH-02may19-341/GCA_945873025.1_MoH-02may19-341_assembly_report.txt  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/945/873/025/GCA_945873025.1_MoH-02may19-341  ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/945/873/025/GCA_945873025.1_MoH-02may19-341/GCA_945873025.1_MoH-02may19-341_assembly_stats.txt

ADD REPLY • link 2.1 years ago by GenoMax 147k

0

Entering edit mode

One option to extract the genome assembly accessions is to use a different database with the Entrez tools. The database that connects the protein accession to the genome accession is the Identical Protein Groups (ipg). If you extract the list of accessions from the last column in your BLAST results, you should be able to use them as input. I'm not very familiar with Entrez tools, but you should be able to combine Entrez with NCBI Datasets to retrieve the genome accessions and download the genomes.

So, here's my suggestion:

Use Entrez to find the genome accessions associated with the protein accessions identified by BLAST. I create a txt file based on the image you posted with the following accessions:

WP_099325812.1
5C2V_A
MBM4064177.1
MBE7443926.1
MBV6466156.1
MBV6343041.1
MBI2470069.1

efetch -db ipg -input pt_accessions.txt
Id      Source  Nucleotide Accession    Start   Stop    Strand  Protein Protein Name    Organism        Strain  Assembly
5575921 RefSeq  NZ_CP049055.1   2479880 2482309 +       WP_099325812.1  hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis     CSTR1   GCF_011066545.1
5575921 RefSeq  NZ_LT934425.1   3896175 3898604 +       WP_099325812.1  hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis             GCF_900232105.1
5575921 RefSeq  NZ_OCTL01000042.1       10011   12440   +       WP_099325812.1  hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis             GCF_900232175.1
5575921 RefSeq  NZ_OCTL01000053.1       15047   17476   -       WP_099325812.1  hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis             GCF_900232175.1
5575921 RefSeq  NZ_OCTL01000095.1       468216  470645  -       WP_099325812.1  hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis             GCF_900232175.1
5575921 RefSeq  NZ_OCTL01000126.1       6615    9044    -       WP_099325812.1  hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis             GCF_900232175.1
5575921 Swiss-Prot      N/A                             Q1Q0T2.1        Hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis
5575921 INSDC   CT573071.1      577578  580007  +       CAJ73613.1      hypothetical (di heme) protein  Candidatus Kuenenia stuttgartiensis
5575921 INSDC   JABTUX010000001.1       58534   60963   -       MBE7545599.1    hypothetical protein    Planctomycetia bacterium       GCA_015075145.1
5575921 INSDC   JAIOIO010000259.1       150     2579    -       MBZ0193178.1    hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis             GCA_019912215.1
5575921 INSDC   SOES01000106.1  180     2609    -       MCF6153641.1    hypothetical protein    Candidatus Kuenenia stuttgartiensis    GCA_021646445.1
5575921 INSDC   JAHDXI010000151.1       150     2579    -       MCL4728533.1    hypothetical protein    Candidatus Kuenenia stuttgartiensis             GCA_023384235.1
5575921 INSDC   CP049055.1      1126548 1128977 +       QII10648.1      hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis     CSTR1   GCA_011066545.1
5575921 INSDC   CP049055.1      2479880 2482309 +       QII12200.1      hydrazine synthase subunit alpha        Candidatus Kuenenia stuttgartiensis     CSTR1   GCA_011066545.1
5575921 INSDC   LT934425.1      2914171 2916600 +       SOH05200.1      hydrazine synthase subunit A    Candidatus Kuenenia stuttgartiensis             GCA_900232105.1
5575921 INSDC   LT934425.1      3896175 3898604 +       SOH06076.1      hydrazine synthase subunit A    Candidatus Kuenenia stuttgartiensis             GCA_900232105.1
5575921 INSDC   PHFY01000181.1  6303    8732    +       TVL97567.1      hypothetical protein    Candidatus Kuenenia stuttgartiensis    GCA_007618145.1
93927976        PDB     N/A                             5C2V_A          Candidatus Kuenenia stuttgartiensis
93927976        PDB     N/A                             5C2V_D          Candidatus Kuenenia stuttgartiensis
93927976        PDB     N/A                             5C2W_A          Candidatus Kuenenia stuttgartiensis
93927976        PDB     N/A                             5C2W_D          Candidatus Kuenenia stuttgartiensis
369295220       INSDC   JABTVC010000001.1       202503  204935  +       MBE7443926.1    hypothetical protein    Planctomycetia bacteriumGCA_015075005.1
369295220       INSDC   JABTVC010000001.1       1063346 1065778 +       MBE7444714.1    hypothetical protein    Planctomycetia bacteriumGCA_015075005.1
383419830       INSDC   JACPHJ010000026.1       366     2798    +       MBI2470069.1    hypothetical protein    Planctomycetota bacterium               GCA_016188675.1
405823912       INSDC   VGXX01000003.1  54018   56447   +       MBM4064177.1    hypothetical protein    Planctomycetota bacterium      GCA_016873155.1
468843258       INSDC   JABXWD010000423.1       131     2611    -       MBV6343041.1    hypothetical protein    Candidatus Magnetobacterium casensis    MYR-1_YQ        GCA_019173545.1
469091989       INSDC   JABAQX010000086.1       119     2551    -       MBV6466156.1    Hydrazine synthase subunit alpha        Anaerolineales bacterium                GCA_019187435.1
469091989       INSDC   CP091279.1      2470348 2472780 -       UJS19433.1      hypothetical protein    Candidatus Brocadia sp.        GCA_021650915.1
469091989       INSDC   CP091279.1      1413450 1415882 -       UJS21951.1      hypothetical protein    Candidatus Brocadia sp.        GCA_021650915.1

Use NCBI datasets to download the genome assemblies. You can cut the last field to extract the list of genome accessions and use that as input to NCBI datasets.

efetch -db ipg -input pt_accessions.txt | sed '1d' | cut -f11 | sort | uniq > genome_accessions.txt

datasets download genome accessions --inputfile genome_accessions.txt

If you want, datasets allows you to download not only the genome sequences, but also other files, such as protein, rna, etc, as long as they are available. To do so, you can use the flag --include and add any files you want.

I hope this helps :)

ADD REPLY • link 2.0 years ago by MirianT_NCBI ▴ 760