Get number of available genomes per taxon - NCBI
2
0
Entering edit mode
6.5 years ago
tlorin ▴ 370

Dear all,

I am blasting (tblastn) a protein onto WGS on NCBI to search directly into the genomes of some taxa.

The protein is not present in every genome and I would like to be able to say "Protein X is present in n organisms out of the N in this lineage." (so, to be able to count N, the total number of sequenced genomes per taxon).

I have found two ways that give quite different results.

  1. tblastn the protein on, say "arthropoda", and retrieve the number appearing in the corresponding field in the output page: "wgs (676 databases)"

  2. use this page and retrieve the number. Here, for "Arthropoda", it is 552

Would you know any other command line or online tool to get N? The ideal way would be to use a taxon number as input (6656 for Arthropoda).

Thanks for your help!

ncbi genome • 1.8k views
ADD COMMENT
0
Entering edit mode
 tail -n+2 assembly_summary_genbank.txt | datamash -sH  -g 6 count 6 collapse 8
ADD REPLY
0
Entering edit mode

@cpad0112 thank you for your help. What is this line doing exactly? I cannot get the number of available genomes for Arthropoda for instance.

ADD REPLY
0
Entering edit mode

It is doing something similar to what I did below using a different program and eliminating a couple of lines at the beginning of that file.

Did you see my note below?

ADD REPLY
2
Entering edit mode
6.5 years ago
tlorin ▴ 370

I found a way.

Count the number of available genomes for a given taxon (here, arthropods; note the wgs):

w3m -dump https://www.ncbi.nlm.nih.gov/nuccore/?term=wgs-master+%5Bprop%5D+AND+arthropoda+%5Borgn%5D|grep "Items:"|rev|cut -f1 -d" "|rev

Count the number of available transcriptomes for a given taxon (here, arthropods; note the tsa):

w3m -dump https://www.ncbi.nlm.nih.gov/nuccore/?term=tsa-master+%5Bprop%5D+AND+arthropoda+%5Borgn%5D|grep "Items:"|rev|cut -f1 -d" "|rev
ADD COMMENT
1
Entering edit mode

These links are for whole genome shotgun sequence records. As such there is no guarantee that these genomes are complete or usable. You would want to include this source of "genome" records in your paper when you mention X out of Y genomes.

ADD REPLY
0
Entering edit mode

@genomax That's true, thanks for mentioning this. For a list of "complete or usable" genomes, what would you suggest instead?

ADD REPLY
1
Entering edit mode
6.5 years ago
GenoMax 147k

Get the assembly_summary_genbank.txt from here. awk -F '\t' '{print $6}' assembly_summary_genbank.txt | sort | uniq -c > file will give you counts of the genomes for various taxid. Similar files can be found for RefSeq genomes here.

I see 20 genomes for arthropoda (taxid: 552) as of this writing. taxid annotations are at species level.

ADD COMMENT
0
Entering edit mode

@genomax thanks! If I understand correctly, in this file each line corresponds to one species. How would I count for any taxonomic level (say, "Arthropoda" = taxon ID 6656)?

ADD REPLY
0
Entering edit mode

taxid annotations in that file seem to be provided at genus/species level.

ADD REPLY
0
Entering edit mode

OK so there is no direct way to get the number of genomes for any given taxa based on this file: it has to be at the species level.

ADD REPLY
0
Entering edit mode

Using NCBI unix utils the information still seems to be at the same level. If you want to confirm it another way.

esearch -db genome -query genome | esummary | xtract -pattern DocumentSummary -element Organism_Name TaxId > file
ADD REPLY
0
Entering edit mode

20 only? This does not seem a lot to me (there are already more than 20 drosophila species that are sequenced)

ADD REPLY

Login before adding your answer.

Traffic: 2767 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6