NCBI came out with a cloud-based solution to query metadata. It might be worthwhile for you to look into that. Below I show the top row for what gets returned when using the AWS-Athena query below. Importantly, under the 'attributes' column you'll find many key/value pairs that are searchable within the query.
acc assay_type center_name consent experiment sample_name instrument librarylayout libraryselection librarysource platform sample_acc biosample organism sra_study releasedate bioproject mbytes loaddate avgspotlen mbases insertsize library_name biosamplemodel_sam collection_date_sam geo_loc_name_country_calc geo_loc_name_country_continent_calc geo_loc_name_sam ena_first_public_run ena_last_update_run sample_name_sam datastore_filetype datastore_provider datastore_region attributes jattr run_file_version
1 SRR12007843 RNA-Seq GEO public SRX8541000 GSM4614996 Illumina
NovaSeq
6000 SINGLE cDNA TRANSCRIPTOMIC ILLUMINA SRS6835808 SAMN15230281 Homo
sapiens SRP267176 2020-07-31 PRJNA639275 324 101 1111 [sra,
run.zq, fastq] [gs, s3, ncbi] [ncbi.public, gs.US,
s3.us-east-1] [{k=geo_accession_exp, v=GSM4614996}, {k=bases,
v=1111010201}, {k=bytes, v=340456699}, {k=run_file_create_date,
v=2020-06-13T12:18:00.000Z}, {k=cell_type_sam_ss_dpl37, v=PBMC},
{k=days_post_symptom_onset_sam, v=13}, {k=disease_state_sam,
v=COVID-19}, {k=gender_sam, v=male}, {k=geographical_location_sam,
v=USA: Atlanta, GA}, {k=severity_sam, v=ICU}, {k=source_name_sam,
v=PBMC}, {k=primary_search, v=15230281}, {k=primary_search, v=639275},
{k=primary_search, v=GSE152418}, {k=primary_search, v=GSM4614996},
{k=primary_search, v=GSM4614996_r1}, {k=primary_search, v=PRJEB40771},
{k=primary_search, v=PRJNA639275}, {k=primary_search, v=SAMN15230281},
{k=primary_search, v=SRP267176}, {k=primary_search, v=SRR12007843},
{k=primary_search, v=SRS6835808}, {k=primary_search,
v=SRX8541000}] {"geo_accession_exp": ["GSM4614996"], "bases":
1111010201, "bytes": 340456699, "run_file_create_date":
"2020-06-13T12:18:00.000Z", "cell_type_sam_ss_dpl37": ["PBMC"],
"days_post_symptom_onset_sam": "13", "disease_state_sam":
["COVID-19"], "gender_sam": ["male"], "geographical_location_sam":
"USA: Atlanta, GA", "severity_sam": "ICU", "source_name_sam":
["PBMC"], "primary_search": "15230281"} 1
Looking for this info seems to lead to https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/. I have not fully explored this but it seems to require an AWS account and may require payment (if the docs are right then perhaps a small one).
This information can also be obtained using EntrezDirect:
Then searching for a specific sample (truncated to save space)
ya but,
entrez != SRA.metadata
...there exists different data between the two.NCBI seems to be making it mandatory to use cloud-based resources, at least for some of their datasets. Regarding specifically the SRA metadata, I've exchanged a few emails with them (NCBI) asking to make a publicly accessible SQL-like server for this, but they don't agree. Considering UCSC does this for all of their tables, it doesn't seem like a stretch for NCBI to do this for a few important tables. As far as I can tell, NCBI just put all their effort into creating cloud resources and they don't want to go anywhere else. At least they have both google and AWS. It could be useful to have more people write them, or even create a petition. I can't imagine that maintaining a public SQL-like server would be costly for NCBI...
AFAIK NCBI makes all SRA metadata available via FTP site: https://ftp.ncbi.nih.gov/sra/reports/Metadata/ . If you have the infrastructure and expertise available then downloading the files and parsing/searching them locally may be the easiest option. Entrezdirect is a suite of command line tools that are used to query various NCBI databases.
It would be unfortunate if NCBI chooses to provide/store different data/metadata from different locations.
Seems like (maybe?) that's what has happened. From the above example we can see that the AWS and entrez query have a lot of overlap, but each contains data unique to that search option. I checked the ftp site, and that data contains everything in entrez and the AWS table...interesting (and a shame) that neither of those search options contain everything in the ftp data. The ftp table is a real hassle to work with though, and you can't really query the data itself directly without a lot of overhead. I suspect NCBIs longer-term plan is to shift to cloud-based resources.
I realized now that the cloud-based options have several tables that can be queried and the table I showed is only the 'metadata' table. There is a 'metadata_json' table, among others, that might return all the info available at the ftp site. In any case, and from my experience, the cloud-based search features are extremely useful - there's just the small price to pay to have access.
joe do you have a sense that in general the cloud based datasets are more complete?
AFAIK, this is the most comprehensive way to search SRA data. If you have a specific question I recommend to write NCBI, they are responsive and helpful. I know there are other ways to find metadata, like entrez or other interfacing tools, but each tool seems to contain different parts of data. This cloud based resource will have everything pertaining to the deposited reads. Also, I know there are other ways to access parts of this data, for example if you change the SRR in the below link you'll find how the data is held. Maybe there is an equivalent for the metadata ...
https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?acc=SRR12007843&accept-alternate-locations=yes