Question

Any way to retrieve the annotation information of genomes under a bioproject ? (NCBI)

0

Entering edit mode

19 months ago

v.berriosfarias ▴ 140

Hello! I want to retrieve the annotation data (tRNAs, rRNA genes) from genomes located under this bioproject https://www.ncbi.nlm.nih.gov/bioproject/729490

There are 677 genomes and by clicking on the "Genome-Annotaiton-Data" of each genome entry (e.g GCA_029245675.1) I can see information regarding to the number of tRNAs and rRNA genes. I need to retrieve a table that report the number of these genes per each genome of this bioproject, is there a way to do this for example by using E-utilities ? I'm looking for some commands to do this but I don't find anything related.

thanks for your time

NCBI PGAP Bioproject annotation • 1.0k views

ADD COMMENT • link 19 months ago by v.berriosfarias ▴ 140

score 1 · Answer 1 · 2023-04-10

$ esearch -db assembly -query GCA_029245675 | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/245/675/GCA_029245675.1_ASM2924567v1

Find the feature table file in this directory: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/245/675/GCA_029245675.1_ASM2924567v1/GCA_029245675.1_ASM2924567v1_feature_table.txt.gz

That has detailed info you need.

If you need just a summary then get feature count file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/245/675/GCA_029245675.1_ASM2924567v1/GCA_029245675.1_ASM2924567v1_feature_count.txt.gz

$ more GCA_029245675.1_ASM2924567v1_feature_count.txt
# Feature       Class   Full Assembly   Assembly-unit accession Assembly-unit name      Unique Ids      Placements
CDS     with_protein    GCA_029245675.1 GCA_029245695.1 Primary Assembly        1824    1824
CDS     without_protein GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      7
gene    RNase_P_RNA     GCA_029245675.1 GCA_029245695.1 Primary Assembly        1       1
gene    SRP_RNA GCA_029245675.1 GCA_029245695.1 Primary Assembly        1       1
gene    protein_coding  GCA_029245675.1 GCA_029245695.1 Primary Assembly        1824    1824
gene    pseudogene      GCA_029245675.1 GCA_029245695.1 Primary Assembly        7       7
gene    rRNA    GCA_029245675.1 GCA_029245695.1 Primary Assembly        3       3
gene    tRNA    GCA_029245675.1 GCA_029245695.1 Primary Assembly        49      49
gene    tmRNA   GCA_029245675.1 GCA_029245695.1 Primary Assembly        1       1
ncRNA   RNase_P_RNA     GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      1
ncRNA   SRP_RNA GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      1
rRNA            GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      3
tRNA            GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      49
tmRNA           GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      1

score 1 · Answer 2 · 2023-04-10

You can use NCBI Datasets to search for genomes by a bioproject and download data directly. In this instance, you can download the annotation data in GFF3 format as follows:

datasets download genome accession PRJNA729490 --annotated --include gff3

Once you have the GFF3 files, you can parse them to extract the information you need, including the counts of different feature types included in the annotation.