Secrets to finding metadata on SRA data?
2
0
Entering edit mode
2.7 years ago

I'm hoping some experts who know the ins-and-outs of using NCBI better than I do can help me with this fairly basic question.

I'm looking at a specific BioProject (Accession PRJDB11398). The abstract says it contains samples from 3 different patient groups: those with periodontitis, those with gingivitis, and healthy people. However, I cannot find any metadataabout the runs. Am I missing the secret data source that would allow me to know which of these 252 BioSamples correspond to which disease group?

Second, related question: I can find no publication associated with this data. Am I not looking in the right place? Are there some "tricks" to finding a publication that perhaps did not get linked to its associated BioProject? This data was collected at Tokyo Medical and Dental University; is it possible that there is a non-English publication somewhere that is not indexed in Pubmed and, if so, any tips on how I might find it?

EDIT: found the paper via a Google search. So, yes Google search is the trick to finding a paper associated with a BioProject and no, they are not always properly linked.

Thanks in advance.

metadata sra ncbi biosample bioproject • 3.6k views
ADD COMMENT
0
Entering edit mode

How useful would it be to go from a bioproject accession to a nf-core manifest?

ADD REPLY
0
Entering edit mode

FWIW I have no idea what an nf-core manifest is or what use it has - perhaps that should be the first thing to do, to demonstrate on a larger scale

ADD REPLY
0
Entering edit mode

when you hit that launch green button here it generates a manifest that describes the sample layout for a nf-core run

ADD REPLY
2
Entering edit mode
2.7 years ago
GenoMax 147k

You can query SRA by using EntrezDirect to get information about the runs. Here is an example snippet.

$ esearch -db sra -query "PRJDB11398" | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
DRR280789,2021-08-24 14:16:09,2021-08-24 14:21:18,76047,19783022,0,260,8,,https://sra-download.ncbi.nlm.nih.gov/traces/dra2/DRR/000274/DRR280789,DRX270372,,RNA-Seq,cDNA,METATRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina MiSeq,DRP007622,PRJDB11398,,757640,DRS202294,SAMD00288208,simple,447426,human oral metagenome,SAMD00288208,,,,,,,no,,,,,TOKYO_MEDEN,DRA011737,,public,CC21F307E5AB3AD9A5FC8C111E35DECB,32894C9FA91BD1719B942C8905CDD991
DRR280790,2021-08-24 14:16:09,2021-08-24 14:51:38,14603,3729248,0,255,1,,https://sra-download.ncbi.nlm.nih.gov/traces/dra2/DRR/000274/DRR280790,DRX270373,,RNA-Seq,cDNA,METATRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,Illumina MiSeq,DRP007622,PRJDB11398,,757640,DRS202295,SAMD00288209,simple,447426,human oral metagenome,SAMD00288209,,,,,,,no,,,,,TOKYO_MEDEN,DRA011737,,public,8448A0733DDE58ACE33791803DDC0B7B,BD15A6A02DF59D32588860D35EBDDD07

It is possible that there is no publication associated with the data yet. Since I don't see anything when querying PubMed with this accession. This is probably not fool-proof.


You can also use NCBI SRA Run selector to see some information. Click on the Metadata link to download a slightly different table than one above.

Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,Bytes,Center Name,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,Experiment,Instrument,Library Name,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,Sample Name,sample_name,SRA Study
DRR280689,RNA-Seq,260,30555000,PRJDB11398,SAMD00288108,15437534,TOKYO_MEDEN,public,sra,"gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",DRX270272,Illumina MiSeq,,SINGLE,cDNA,METATRANSCRIPTOMIC,human oral metagenome,ILLUMINA,2021-08-24T00:00:00Z,SAMD00288108,sample10G_afterdec_Fp_clean.fq,DRP007622
DRR280690,RNA-Seq,246,8921381,PRJDB11398,SAMD00288109,5551730,TOKYO_MEDEN,public,sra,"gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",DRX270273,Illumina MiSeq,,SINGLE,cDNA,METATRANSCRIPTOMIC,human oral metagenome,ILLUMINA,2021-08-24T00:00:00Z,SAMD00288109,sample10G_afterdec_Fup_clean.fq,DRP007622
DRR280691,RNA-Seq,165,19393221,PRJDB11398,SAMD00288110,12754105,TOKYO_MEDEN,public,sra,"gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",DRX270274,Illumina MiSeq,,SINGLE,cDNA,METATRANSCRIPTOMIC,human oral metagenome,ILLUMINA,2021-08-24T00:00:00Z,SAMD00288110,sample10G_afterdec_Rp_clean.fq,DRP007622
ADD COMMENT
0
Entering edit mode

Thanks, GenoMax! I found this Metadata link...I don't see information about the patient's status/disease group. I think this must mean they didn't upload it. I just wanted to be sure I wasn't missing something obvious.

ADD REPLY
0
Entering edit mode

If that is sensitive information then you are not likely to get if from a public source.

ADD REPLY
1
Entering edit mode

I figured it out! The samples have labels like 10G, 10P, 10H, which corresponds to the different disease states (G, P, H) in a single patient's mouth (patient number 10). Each set of forward and reverse reads have different IDs (e.g. for patient 10: DRR280689 and DRR280691 are the F and R reads, and DRR280690 is unpaired reads...). A little bit odd.
Thank you, again!

ADD REPLY
0
Entering edit mode

That is odd. Not sure why they would submit these separately.

ADD REPLY
0
Entering edit mode

great, thanks for sharing what you found,

and I can't help but shake my head when I learn, that even though there are myriad metadata fields, the most important information that identifies that sample origin had to be guessed and parsed out from inside a name.

ADD REPLY
1
Entering edit mode

Also found the paper via Google search. So, to answer my own question, yes the linking between Pubmed ID and BioProject is a bit...flimsy.

ADD REPLY
1
Entering edit mode
2.7 years ago

allow me to plug my cool package bio a bit (https://www.bioinfo.help/) that you could use like so:

bio search PRJDB11398

prints JSON like this:

[
  {
    "run_accession": "DRR280940",
    "sample_accession": "SAMD00288359",
    "first_public": "2021-08-29",
    "country": "",
    "sample_alias": "SAMD00288359",
    "fastq_bytes": "87567",
    "read_count": "731",
    "library_name": "",
    "library_strategy": "RNA-Seq",
    "library_source": "METATRANSCRIPTOMIC",
    "library_layout": "SINGLE",
    "instrument_platform": "ILLUMINA",
    "instrument_model": "Illumina MiSeq",
    "study_title": "Comparison of microbiome of periodontitis, gingivitis, and healthy",
    "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz"
}
...
]

you can process the json with tools like jq or ask for the output as comma-separated fields like so:

bio search PRJDB11398 --csv

to get:

DRR280939,SAMD00288358,2021-08-29,,SAMD00288358,33669337,288229,,RNA-Seq,METATRANSCRIPTOMIC,SINGLE,ILLUMINA,Illumina MiSeq,"Comparison of microbiome of periodontitis, gingivitis, and healthy",ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280939/DRR280939.fastq.gz
DRR280940,SAMD00288359,2021-08-29,,SAMD00288359,87567,731,,RNA-Seq,METATRANSCRIPTOMIC,SINGLE,ILLUMINA,Illumina MiSeq,"Comparison of microbiome of periodontitis, gingivitis, and healthy",ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz

now the above selects the most commonly used fields, to actually get all the metadata

bio search PRJDB11398 --all

it prints things like:

 {
        "accession": "SAMD00288359",
        "altitude": "",
        "assembly_quality": "",
        "assembly_software": "",
        "base_count": "105313",
        "binning_software": "",
        "bio_material": "",
        "broker_name": "",
        "cell_line": "",
        "cell_type": "",
        "center_name": "TOKYO_MEDEN",
        "checklist": "",
        "collected_by": "",
        "collection_date": "",
        "collection_date_submitted": "",
        "completeness_score": "",
        "contamination_score": "",
        "country": "",
        "cram_index_aspera": "",
        "cram_index_ftp": "",
        "cram_index_galaxy": "",
        "cultivar": "",
        "culture_collection": "",
        "depth": "",
        "description": "Illumina MiSeq sequencing; Illumina MiSeq sequencing of SAMD00288359",
        "dev_stage": "",
        "ecotype": "",
        "elevation": "",
        "environment_biome": "",
        "environment_feature": "",
        "environment_material": "",
        "environmental_package": "",
        "environmental_sample": "false",
        "experiment_accession": "DRX270523",
        "experiment_alias": "DRX270523",
        "experiment_title": "Illumina MiSeq sequencing; Illumina MiSeq sequencing of SAMD00288359",
        "experimental_factor": "",
        "fastq_aspera": "fasp.sra.ebi.ac.uk:/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz",
        "fastq_bytes": "87567",
        "fastq_ftp": "ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz",
        "fastq_galaxy": "ftp.sra.ebi.ac.uk/vol1/fastq/DRR280/DRR280940/DRR280940.fastq.gz",
        "fastq_md5": "6547d173a825a3da2ee89df2b0fa045f",
        "first_created": "2021-08-29",
        "first_public": "2021-08-29",
        "germline": "false",
        "host": "Homo sapiens",
        "host_body_site": "",
        "host_genotype": "",
        "host_gravidity": "",
        "host_growth_conditions": "",
        "host_phenotype": "",
        "host_sex": "",
        "host_status": "",
        "host_tax_id": "9606",
        "identified_by": "",
        "instrument_model": "Illumina MiSeq",
        "instrument_platform": "ILLUMINA",
        "investigation_type": "",
        "isolate": "",
        "isolation_source": "",
        "last_updated": "2021-08-29",
        "lat": "",
        "library_construction_protocol": "",
        "library_layout": "SINGLE",
        "library_name": "",
        "library_selection": "cDNA",
        "library_source": "METATRANSCRIPTOMIC",
        "library_strategy": "RNA-Seq",
        "location": "",
        "lon": "",
        "mating_type": "",
        "nominal_length": "",
        "nominal_sdev": "",
        "parent_study": "",
        "ph": "",
        "project_name": "",
        "protocol_label": "",
        "read_count": "731",
        "run_accession": "DRR280940",
        "run_alias": "DRR280940",
        "salinity": "",
        "sample_accession": "SAMD00288359",
        "sample_alias": "SAMD00288359",
        "sample_capture_status": "",
        "sample_collection": "",
        "sample_description": "sample9P_afterdec_Rup_clean.fq",
        "sample_material": "",
        "sample_title": "sample9P_afterdec_Rup_clean.fq",
        "sampling_campaign": "",
        "sampling_platform": "",
        "sampling_site": "",
        "scientific_name": "human oral metagenome",
        "secondary_sample_accession": "DRS202445",
        "secondary_study_accession": "DRP007622",
        "sequencing_method": "",
        "serotype": "",
        "serovar": "",
        "sex": "",
        "specimen_voucher": "",
        "sra_aspera": "fasp.sra.ebi.ac.uk:/vol1/drr/DRR280/DRR280940",
        "sra_bytes": "158541",
        "sra_ftp": "ftp.sra.ebi.ac.uk/vol1/drr/DRR280/DRR280940",
        "sra_galaxy": "ftp.sra.ebi.ac.uk/vol1/drr/DRR280/DRR280940",
        "sra_md5": "02d7286bc443f8aff5418ca1144dd624",
        "strain": "",
        "study_accession": "PRJDB11398",
        "study_alias": "DRP007622",
        "study_title": "Comparison of microbiome of periodontitis, gingivitis, and healthy",
        "sub_species": "",
        "sub_strain": "",
        "submission_accession": "DRA011737",
        "submission_tool": "",
        "submitted_aspera": "",
        "submitted_bytes": "",
        "submitted_format": "",
        "submitted_ftp": "",
        "submitted_galaxy": "",
        "submitted_host_sex": "",
        "submitted_md5": "",
        "submitted_sex": "",
        "target_gene": "",
        "tax_id": "447426",
        "taxonomic_classification": "",
        "taxonomic_identity_marker": "",
        "temperature": "",
        "tissue_lib": "",
        "tissue_type": "",
        "variety": ""
    }
ADD COMMENT
0
Entering edit mode

Thanks, Istvan! This is pretty cool. It looks to me like they didn't upload the disease group for each BioSample. This json is quite nice!

ADD REPLY

Login before adding your answer.

Traffic: 2397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6