Missing columns in meta table from SRA Selector
2
0
Entering edit mode
19 months ago
tnocs • 0

I'm trying to fetch meta data from the SRA Run Selector:

https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA253315&o=acc_s%3Aa

using the linux command line and a project id. I can do it with this line:

esearch -db sra -query PRJNA253315 | efetch -format runinfo > file.csv

But it doesn't give me all columns, the "Antibody" and "TREATMENT" columns aren't there for example. I know there is a way I can specify exactly which columns I want, but I also don't want do that. I just want the exact table that I would get if I clicked the "Metadata" button on the website, how can I do this in the command line? Does esearch and efetch offer ways of doing this?

SRA esearch efetch • 982 views
ADD COMMENT
0
Entering edit mode

Does esearch and efetch offer ways of doing this?

No there does not seem to be. Information provided in SRA Run selector is not is not identical to one provided by EntrezDirect.

ADD REPLY
0
Entering edit mode
19 months ago

Unfortunately there is not enforced standard of what metadata must make into the SRA, it is very frustrating actually and makes reproducing any analysis needlessly complicated.

You can look at what EBI fields are there, and sometimes they produce more fields than SRA:

pip install bio

then look at the metadata that way:

bio search PRJNA253315 --all | more

prints things like:

[
    {
        "accession": "SAMN02870079",
        "altitude": "",
        "assembly_quality": "",
        "assembly_software": "",
        "base_count": "3049954530",
        "binning_software": "",
        "bio_material": "",
        "broker_name": "",
        "cell_line": "IMR90",
        "cell_type": "",
        "center_name": "GEO",
        "checklist": "",
        "collected_by": "",
        "collection_date": "",
        "collection_date_submitted": "",
        "completeness_score": "",
        "contamination_score": "",
        "country": "",
        "cram_index_aspera": "",
        "cram_index_ftp": "",
        "cram_index_galaxy": "",
        "cultivar": "",
        "culture_collection": "",
        "depth": "",
        "description": "Illumina HiSeq 2000 sequencing; GSM1418957: H3 ChIP (DMSO); Homo sapiens; ChIP-Seq",
        "dev_stage": "",
        "ecotype": "",
        "elevation": "",
        "environment_biome": "",
        "environment_feature": "",
        "environment_material": "",
        "environmental_package": "",
        "environmental_sample": "false",
        "experiment_accession": "SRX620734",
        "experiment_alias": "GSM1418957",
        "experiment_title": "Illumina HiSeq 2000 sequencing; GSM1418957: H3 ChIP (DMSO); Homo sapiens; ChIP-Seq",
        "experimental_factor": "",
        "fastq_aspera": "fasp.sra.ebi.ac.uk:/vol1/fastq/SRR144/004/SRR1448774/SRR1448774.fastq.gz",
        "fastq_bytes": "2838144326",
        "fastq_galaxy": "ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/004/SRR1448774/SRR1448774.fastq.gz",
        "fastq_md5": "2ac617b0b8670c9d4a9bc15213f68c4f",
        "first_created": "2015-06-05",
        "first_public": "2015-06-05",
        "germline": "false",
        "host": "",
        "host_body_site": "",
        "host_genotype": "",
        "host_gravidity": "",
        "host_growth_conditions": "",
        "host_phenotype": "",
        "host_sex": "",
        "host_status": "",
        "host_tax_id": "",
        "identified_by": "",
        "instrument_model": "Illumina HiSeq 2000",
        "instrument_platform": "ILLUMINA",
        "investigation_type": "",
        "isolate": "",
        "isolation_source": "",
        "last_updated": "2019-11-16",
        "lat": "",
        "library_construction_protocol": "For ChIP-seq, cells were crosslinked with formaldehyde (1% final) for 10min at room temperature, and harvested for sonication.  Nuclei were extracted and chromatin was sheared to an average size of 200bp using a Diagenode Bioruptor. For RNA-seq, cells were harvested and PolyA+ RNA was isolated using the NEBNext Ultra RNA-seq Isolation Module. For ATAC-seq, cells were harvested, nuclei were prepped,and transposase was added for 30 minutes at 30C. Sequencing libraries for ChIP-seq were constructd using the NEBNext Ultra kit as per manufacturer's recommended instructions Sequencing libraries for ATAC-seq were constructed using custom Nextera-compatible primers, from Nextera-adapted DNA fragments",
        "library_layout": "SINGLE",
        "library_name": "",
        "library_selection": "ChIP",
        "library_source": "GENOMIC",
        "library_strategy": "ChIP-Seq",
        "location": "",
        "lon": "",
        "mating_type": "",
        "nominal_length": "",
        "nominal_sdev": "",
        "parent_study": "PRJNA9558",
        "ph": "",
        "project_name": "",
        "protocol_label": "",
        "read_count": "59803030",
        "run_accession": "SRR1448774",
        "run_alias": "GSM1418957_r1",
        "salinity": "",
        "sample_accession": "SAMN02870079",
        "sample_alias": "GSM1418957",
        "sample_capture_status": "",
        "sample_collection": "",
        "sample_description": "H3 ChIP (DMSO)",
        "sample_material": "",
        "sample_title": "H3 ChIP (DMSO)",
        "sampling_campaign": "",
        "sampling_platform": "",
        "sampling_site": "",
        "scientific_name": "Homo sapiens",
        "secondary_sample_accession": "SRS645140",
        "secondary_study_accession": "SRP043510",
        "sequencing_method": "",
        "serotype": "",
        "serovar": "",
        "sex": "",
        "specimen_voucher": "",
        "sra_aspera": "fasp.sra.ebi.ac.uk:/vol1/srr/SRR144/004/SRR1448774",
        "sra_bytes": "1994732885",
        "sra_ftp": "ftp.sra.ebi.ac.uk/vol1/srr/SRR144/004/SRR1448774",
        "sra_galaxy": "ftp.sra.ebi.ac.uk/vol1/srr/SRR144/004/SRR1448774",
        "sra_md5": "e3920e0a35006ada4a8738af2c7bfcf7",
        "strain": "",
        "study_accession": "PRJNA253315",
        "study_alias": "GSE58740",
        "study_title": "Chromatin dynamics of p53 binding sites in IMR90",
        "sub_species": "",
        "sub_strain": "",
        "submission_accession": "SRA172049",
        "submission_tool": "",
        "submitted_aspera": "",
        "submitted_bytes": "",
        "submitted_format": "",
        "submitted_ftp": "",
        "submitted_galaxy": "",
        "submitted_host_sex": "",
        "submitted_md5": "",
        "submitted_sex": "",
        "target_gene": "",
        "tax_id": "9606",
        "taxonomic_classification": "",
        "taxonomic_identity_marker": "",
        "temperature": "",
        "tissue_lib": "",
        "tissue_type": "",
        "variety": "",
        "fastq_url": [
            "https://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/004/SRR1448774/SRR1448774.fastq.gz"
        ],
        "info": "3 GB file; 60 million reads; 3.0 billion sequenced bases"
    },
  [...]        

it pretty nuts actually, look at all the fields not filled in, sometimes you can parse out various information from other fields.

ADD COMMENT
0
Entering edit mode

To be fair, it is the submitter's responsibility to fill the information in. If NCBI makes all fields mandatory then it will make the submission process more difficult than it is now. As is people struggle with SRA submissions.

ADD REPLY
0
Entering edit mode

I understand that, and I dislike the curretn submission process because it asks for so many useless questions. The goal is not to make every field mandatory, the goal is to ask the questions that are relevant.

If someone needs to connect a sample to a treatment, how do they do it automatically?

This should be a question on the form to make the submitter think about it and explain it in words - and not a mandatory field on the form.

ADD REPLY
0
Entering edit mode
19 months ago
zhousun21 ▴ 40

For a lot of SRA submissions, there is no antibody or treatment data associated with the organism or experiment. So, nothing for the submitter to enter in those fields.

ADD COMMENT

Login before adding your answer.

Traffic: 2122 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6