Question

Secrets to finding metadata on SRA data?

0

Entering edit mode

2.8 years ago

sovrappensiero ▴ 100

I'm hoping some experts who know the ins-and-outs of using NCBI better than I do can help me with this fairly basic question.

I'm looking at a specific BioProject (Accession PRJDB11398). The abstract says it contains samples from 3 different patient groups: those with periodontitis, those with gingivitis, and healthy people. However, I cannot find any metadataabout the runs. Am I missing the secret data source that would allow me to know which of these 252 BioSamples correspond to which disease group?

Second, related question: I can find no publication associated with this data. Am I not looking in the right place? Are there some "tricks" to finding a publication that perhaps did not get linked to its associated BioProject? This data was collected at Tokyo Medical and Dental University; is it possible that there is a non-English publication somewhere that is not indexed in Pubmed and, if so, any tips on how I might find it?

EDIT: found the paper via a Google search. So, yes Google search is the trick to finding a paper associated with a BioProject and no, they are not always properly linked.

Thanks in advance.

metadata sra ncbi biosample bioproject • 3.7k views

ADD COMMENT • link updated 2.7 years ago by Jeremy Leipzig 22k • written 2.8 years ago by sovrappensiero ▴ 100

0

Entering edit mode

How useful would it be to go from a bioproject accession to a nf-core manifest?

ADD REPLY • link 2.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

FWIW I have no idea what an nf-core manifest is or what use it has - perhaps that should be the first thing to do, to demonstrate on a larger scale

ADD REPLY • link 2.7 years ago by Istvan Albert 102k

0

Entering edit mode

when you hit that launch green button here it generates a manifest that describes the sample layout for a nf-core run

ADD REPLY • link 2.7 years ago by Jeremy Leipzig 22k

1

Entering edit mode

2.8 years ago

Istvan Albert 102k

allow me to plug my cool package bio a bit (https://www.bioinfo.help/) that you could use like so:

bio search PRJDB11398

prints JSON like this:

[
  {
    "run_accession": "DRR280940",
    "sample_accession": "SAMD00288359",
    "first_public": "2021-08-29",
    "country": "",
    "sample_alias": "SAMD00288359",
    "fastq_bytes": "87567",
    "read_count": "731",
    "library_name": "",
    "library_strategy": "RNA-Seq",
    "library_source": "METATRANSCRIPTOMIC",
    "library_layout": "SINGLE",
    "instrument_platform": "ILLUMINA",
    "instrument_model": "Illumina MiSeq",
    "study_title": "Comparison of microbiome of periodontitis, gingivitis, and healthy",
    "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz"
}
...
]

you can process the json with tools like jq or ask for the output as comma-separated fields like so:

bio search PRJDB11398 --csv

to get:

DRR280939,SAMD00288358,2021-08-29,,SAMD00288358,33669337,288229,,RNA-Seq,METATRANSCRIPTOMIC,SINGLE,ILLUMINA,Illumina MiSeq,"Comparison of microbiome of periodontitis, gingivitis, and healthy",ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280939/DRR280939.fastq.gz
DRR280940,SAMD00288359,2021-08-29,,SAMD00288359,87567,731,,RNA-Seq,METATRANSCRIPTOMIC,SINGLE,ILLUMINA,Illumina MiSeq,"Comparison of microbiome of periodontitis, gingivitis, and healthy",ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz

now the above selects the most commonly used fields, to actually get all the metadata

bio search PRJDB11398 --all

it prints things like:

 {
        "accession": "SAMD00288359",
        "altitude": "",
        "assembly_quality": "",
        "assembly_software": "",
        "base_count": "105313",
        "binning_software": "",
        "bio_material": "",
        "broker_name": "",
        "cell_line": "",
        "cell_type": "",
        "center_name": "TOKYO_MEDEN",
        "checklist": "",
        "collected_by": "",
        "collection_date": "",
        "collection_date_submitted": "",
        "completeness_score": "",
        "contamination_score": "",
        "country": "",
        "cram_index_aspera": "",
        "cram_index_ftp": "",
        "cram_index_galaxy": "",
        "cultivar": "",
        "culture_collection": "",
        "depth": "",
        "description": "Illumina MiSeq sequencing; Illumina MiSeq sequencing of SAMD00288359",
        "dev_stage": "",
        "ecotype": "",
        "elevation": "",
        "environment_biome": "",
        "environment_feature": "",
        "environment_material": "",
        "environmental_package": "",
        "environmental_sample": "false",
        "experiment_accession": "DRX270523",
        "experiment_alias": "DRX270523",
        "experiment_title": "Illumina MiSeq sequencing; Illumina MiSeq sequencing of SAMD00288359",
        "experimental_factor": "",
        "fastq_aspera": "fasp.sra.ebi.ac.uk:/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz",
        "fastq_bytes": "87567",
        "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz",
        "fastq_galaxy": "ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz",
        "fastq_md5": "6547d173a825a3da2ee89df2b0fa045f",
        "first_created": "2021-08-29",
        "first_public": "2021-08-29",
        "germline": "false",
        "host": "Homo sapiens",
        "host_body_site": "",
        "host_genotype": "",
        "host_gravidity": "",
        "host_growth_conditions": "",
        "host_phenotype": "",
        "host_sex": "",
        "host_status": "",
        "host_tax_id": "9606",
        "identified_by": "",
        "instrument_model": "Illumina MiSeq",
        "instrument_platform": "ILLUMINA",
        "investigation_type": "",
        "isolate": "",
        "isolation_source": "",
        "last_updated": "2021-08-29",
        "lat": "",
        "library_construction_protocol": "",
        "library_layout": "SINGLE",
        "library_name": "",
        "library_selection": "cDNA",
        "library_source": "METATRANSCRIPTOMIC",
        "library_strategy": "RNA-Seq",
        "location": "",
        "lon": "",
        "mating_type": "",
        "nominal_length": "",
        "nominal_sdev": "",
        "parent_study": "",
        "ph": "",
        "project_name": "",
        "protocol_label": "",
        "read_count": "731",
        "run_accession": "DRR280940",
        "run_alias": "DRR280940",
        "salinity": "",
        "sample_accession": "SAMD00288359",
        "sample_alias": "SAMD00288359",
        "sample_capture_status": "",
        "sample_collection": "",
        "sample_description": "sample9P_afterdec_Rup_clean.fq",
        "sample_material": "",
        "sample_title": "sample9P_afterdec_Rup_clean.fq",
        "sampling_campaign": "",
        "sampling_platform": "",
        "sampling_site": "",
        "scientific_name": "human oral metagenome",
        "secondary_sample_accession": "DRS202445",
        "secondary_study_accession": "DRP007622",
        "sequencing_method": "",
        "serotype": "",
        "serovar": "",
        "sex": "",
        "specimen_voucher": "",
        "sra_aspera": "fasp.sra.ebi.ac.uk:/vol1/drr/DRR280/DRR280940",
        "sra_bytes": "158541",
        "sra_ftp": "ftp.sra.ebi.ac.uk/vol1/drr/DRR280/DRR280940",
        "sra_galaxy": "ftp.sra.ebi.ac.uk/vol1/drr/DRR280/DRR280940",
        "sra_md5": "02d7286bc443f8aff5418ca1144dd624",
        "strain": "",
        "study_accession": "PRJDB11398",
        "study_alias": "DRP007622",
        "study_title": "Comparison of microbiome of periodontitis, gingivitis, and healthy",
        "sub_species": "",
        "sub_strain": "",
        "submission_accession": "DRA011737",
        "submission_tool": "",
        "submitted_aspera": "",
        "submitted_bytes": "",
        "submitted_format": "",
        "submitted_ftp": "",
        "submitted_galaxy": "",
        "submitted_host_sex": "",
        "submitted_md5": "",
        "submitted_sex": "",
        "target_gene": "",
        "tax_id": "447426",
        "taxonomic_classification": "",
        "taxonomic_identity_marker": "",
        "temperature": "",
        "tissue_lib": "",
        "tissue_type": "",
        "variety": ""
    }

ADD COMMENT • link 2.8 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks, Istvan! This is pretty cool. It looks to me like they didn't upload the disease group for each BioSample. This json is quite nice!

ADD REPLY • link 2.8 years ago by sovrappensiero ▴ 100

score 2 · Accepted Answer · 2022-03-03

You can query SRA by using EntrezDirect to get information about the runs. Here is an example snippet.

$ esearch -db sra -query "PRJDB11398" | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
DRR280789,2021-08-24 14:16:09,2021-08-24 14:21:18,76047,19783022,0,260,8,,https://sra-download.ncbi.nlm.nih.gov/traces/dra2/DRR/000274/DRR280789,DRX270372,,RNA-Seq,cDNA,METATRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina MiSeq,DRP007622,PRJDB11398,,757640,DRS202294,SAMD00288208,simple,447426,human oral metagenome,SAMD00288208,,,,,,,no,,,,,TOKYO_MEDEN,DRA011737,,public,CC21F307E5AB3AD9A5FC8C111E35DECB,32894C9FA91BD1719B942C8905CDD991
DRR280790,2021-08-24 14:16:09,2021-08-24 14:51:38,14603,3729248,0,255,1,,https://sra-download.ncbi.nlm.nih.gov/traces/dra2/DRR/000274/DRR280790,DRX270373,,RNA-Seq,cDNA,METATRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina MiSeq,DRP007622,PRJDB11398,,757640,DRS202295,SAMD00288209,simple,447426,human oral metagenome,SAMD00288209,,,,,,,no,,,,,TOKYO_MEDEN,DRA011737,,public,8448A0733DDE58ACE33791803DDC0B7B,BD15A6A02DF59D32588860D35EBDDD07

It is possible that there is no publication associated with the data yet. Since I don't see anything when querying PubMed with this accession. This is probably not fool-proof.

You can also use NCBI SRA Run selector to see some information. Click on the Metadata link to download a slightly different table than one above.

Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,Bytes,Center Name,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,Experiment,Instrument,Library Name,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,Sample Name,sample_name,SRA Study
DRR280689,RNA-Seq,260,30555000,PRJDB11398,SAMD00288108,15437534,TOKYO_MEDEN,public,sra,"gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",DRX270272,Illumina MiSeq,,SINGLE,cDNA,METATRANSCRIPTOMIC,human oral metagenome,ILLUMINA,2021-08-24T00:00:00Z,SAMD00288108,sample10G_afterdec_Fp_clean.fq,DRP007622
DRR280690,RNA-Seq,246,8921381,PRJDB11398,SAMD00288109,5551730,TOKYO_MEDEN,public,sra,"gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",DRX270273,Illumina MiSeq,,SINGLE,cDNA,METATRANSCRIPTOMIC,human oral metagenome,ILLUMINA,2021-08-24T00:00:00Z,SAMD00288109,sample10G_afterdec_Fup_clean.fq,DRP007622
DRR280691,RNA-Seq,165,19393221,PRJDB11398,SAMD00288110,12754105,TOKYO_MEDEN,public,sra,"gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",DRX270274,Illumina MiSeq,,SINGLE,cDNA,METATRANSCRIPTOMIC,human oral metagenome,ILLUMINA,2021-08-24T00:00:00Z,SAMD00288110,sample10G_afterdec_Rp_clean.fq,DRP007622