Question

download all metadata from SRA

0

Entering edit mode

4.2 years ago

grant.hovhannisyan ★ 2.6k

From SRA, how would you get the number of DNAseq samples per year for top 10 most frequently sequenced species? Or alternatively how to download all SRA metadata? This source does not contain species info ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/SRA_Accessions.tab..

SRA • 3.4k views

ADD COMMENT • link 3.8 years ago by grant.hovhannisyan ★ 2.6k

score 2 · Answer 1 · 2021-03-19

Using EntrezDirect to get you started.
This is likely not a perfect query. I will think about this some more later. Adjust date range as needed.

$ esearch -db sra -query "2021/1/1:2021/1/2[Publication Date]"  | elink -target biosample | esummary | xtract -pattern DocumentSummary -element Organism | sort | uniq -c | sort -k1,1nr
 170 Glycine max
 121 Rhodeus ocellatus kurumeus
  83 air metagenome
  62 Culex bitaeniorhynchus
  48 Culex tritaeniorhynchus
  39 Escherichia coli
  37 Kalanchoe laxiflora
  36 Homo sapiens
  35 soil metagenome
  32 Mus musculus
  22 Rhodeus ocellatus ocellatus
  20 feces metagenome
  13 Salmonella enterica subsp. enterica serovar Infantis
   9 Cardamine flexuosa
   8 Salmonella enterica subsp. enterica serovar Kentucky
   7 Arabidopsis thaliana
   7 Campylobacter jejuni
   7 Salmonella enterica subsp. enterica serovar Enteritidis
   6 Zea mays
   4 Salmonella enterica subsp. enterica serovar Typhimurium
   3 Salmonella enterica
   3 Salmonella enterica subsp. enterica
   3 Salmonella enterica subsp. enterica serovar Newport
   2 Salmonella enterica subsp. enterica serovar Agona
   2 Salmonella enterica subsp. enterica serovar Eko
   2 Salmonella enterica subsp. enterica serovar London
   2 Salmonella enterica subsp. enterica serovar Schwarzengrund
   2 Vicia sativa
   2 mixed culture
   1 Abeliophyllum distichum f. lilacinum
   1 Aspergillus aculeatinus
   1 Campylobacter jejuni subsp. jejuni
   1 Fagus sylvatica
   1 Nicotiana
   1 Physalis pubescens
   1 Polygonatum kingianum
   1 Rhus punjabensis var. sinica
   1 Salmonella enterica subsp. enterica serovar 4,[5],12:i:-
   1 Salmonella enterica subsp. enterica serovar Anatum
   1 Salmonella enterica subsp. enterica serovar Brandenburg
   1 Salmonella enterica subsp. enterica serovar Derby
   1 Salmonella enterica subsp. enterica serovar Johannesburg
   1 Salmonella enterica subsp. enterica serovar Senftenberg
   1 Shigella sonnei
   1 freshwater sediment metagenome
   1 riverine metagenome

If you are willing to write some code you can extract lot more info from a query like this

$ esearch -db sra -query "2021/1/1:2021/1/2[Publication Date]"  | esummary | head -100

https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130524/esummary_sra.dtd">

<DocumentSummarySet status="OK">
<DocumentSummary>
<Id>11835626</Id>
    <ExpXml>  <Summary><Title>RNA-Seq of early induced cardiac progenitors (Day-7)</Title><Platform instrument_model="Illumina HiSeq 2500">ILLUMINA</Platform><Statistics total_runs="1" total_spots="36730347" total_bases="11019104100" total_size="4562391529" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA1123873" center_name="University of Cincinnati" contact_name="Jialiang Liang" lab_name="Department of Pathology"/><Experiment acc="SRX9106574" ver="4" status="public" name="RNA-Seq of early induced cardiac progenitors (Day-7)"/><Study acc="SRP282054" name="Activation of endogenous genes by CRISPR enables conversion of mouse fibroblasts into cardiac progenitor cells"/><Organism taxid="10090" ScientificName="Mus musculus"/><Sample acc="SRS7349991" name=""/><Instrument ILLUMINA="Illumina HiSeq 2500"/><Library_descriptor><LIBRARY_NAME>T4</LIBRARY_NAME><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>Oligo-dT</LIBRARY_SELECTION><LIBRARY_LAYOUT>                 <PAIRED/>               </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA662934</Bioproject><Biosample>SAMN16109872</Biosample>  </ExpXml>
    <Runs>                                <Run acc="SRR12623858" total_spots="36730347" total_bases="11019104100" load_done="true" is_public="true" cluster_name="public" static_data_available="true"/>                                </Runs>
    <ExtLinks></ExtLinks>
    <CreateDate>2021/01/01</CreateDate>
    <UpdateDate>2021/02/02</UpdateDate>
</DocumentSummary>