I have thousands of SRA short reads on a server labeled by their accession IDs only. I need an easy way to get their taxa without searching each ID individually on the SRA database. Does anyone have a way of doing this?
I have thousands of SRA short reads on a server labeled by their accession IDs only. I need an easy way to get their taxa without searching each ID individually on the SRA database. Does anyone have a way of doing this?
Browser method
Navigate to the NCBI SRA portal and enter the query. Click on the Send To
link at the top right corner of the results table and download the results table to a file in 'RunInfo' format as shown in the image below:
You can open this comma-delimited file with Excel or any other spreadsheet program. This table has both the TaxID
and ScientificName
columns.
Command-line method
You can use Entrez Direct for this. If you pipe the first esearch
command to efetch -format runinfo
, you will get the comma-delimited runinfo table that has the TaxID
and ScientificName
columns. Alternatively, you can extract only a select set of fields as shown below:
$ esearch -db sra -query 'SRP014739' \
| esummary \
| xtract -pattern DocumentSummary -element Study@acc,Sample@acc,Experiment@acc,Run@acc,Organism@taxid,Organism@ScientificName
SRP014739 SRS353575 SRX174474 SRR534566 9606 Homo sapiens
SRP014739 SRS353574 SRX174473 SRR534565 9606 Homo sapiens
SRP014739 SRS353573 SRX174472 SRR534564 9606 Homo sapiens
SRP014739 SRS353572 SRX174471 SRR534563 9606 Homo sapiens
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you. I implemented your Entrez Direct approach and it worked fantastically.
Thanks! This is my first time asking a question on BioStars. It wasn't immediately obvious to do that in my browser. Sorry.