Question

Downloading metadata of all insect SRA sequences from NCBI SRA

0

Entering edit mode

11 months ago

nitinra ▴ 50

Hello all,

I am planning to screen 10 sequences/species for all insect species (minimum 1 sequence) that have an SRA sequence on NCBI for specific bacteria. So far, to download the SRA accession, I am navigating through the NCBI taxonomy, manually checking sequence metadata to find out species id etc and then downloading it. Is there any way to speed this process up where I can download either the entire insect SRA metadata or even at the order level to make my job easier? If I can download the metadata, I can then select the sequences I want and use batch entrez to download it. Any help in this would be greatly appreciated!

Thanks!

database NCBI SRA • 648 views

ADD COMMENT • link updated 11 months ago by Philipp Bayer 8.8k • written 11 months ago by nitinra ▴ 50

score 3 · Accepted Answer · 2024-08-26

I believe this should work:

esearch -db sra -query "Arthropoda[Organism]" |  efetch -format runinfo

This will print rows in comma-delimited format like:

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
DRR577282,2024-08-26 13:38:19,2024-08-26 13:54:36,2049652,22280260442,0,10870,13062,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-34/DRR000/577/DRR577282/DRR577282.1,DRX560715,,WGS,other,GENOMIC,SINGLE,0,0,PACBIO_SMRT,Sequel IIe,DRP011949,,,0,DRS402361,,simple,79782,Cimex lectularius,DRS402361,,,,,,,no,,,,,HIROSHIMA UNIVERSITY,DRA018917,,public,26389700DC628652166EE77F6BBC90B2,725D0A49A4AEBD8933AC764D79FDDE0B
DRR577283,2024-08-26 13:38:19,2024-08-26 14:00:19,1990260,20130137244,0,10114,11641,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos2/sra-pub-run-34/DRR000/577/DRR577283/DRR577283.1,DRX560716,,WGS,other,GENOMIC,SINGLE,0,0,PACBIO_SMRT,Sequel IIe,DRP011951,,,0,DRS402363,,simple,79782,Cimex lectularius,DRS402363,,,,,,,no,,,,,HIROSHIMA UNIVERSITY,DRA018918,,public,77AB85865C455CC534EE518AA439ADB0,F4C6CCCFAF1C8B7C2FAFD5D9432D0466

You can make this faster by being a bit more stringent in the query - for example, this will download all RNASeq and genomic sequences, perhaps you only want one of them. Then use your favorite tool to pull out only the columns you need!