Question

different ways of downloading SRA metadata

0

Entering edit mode

2.8 years ago

Mathias ▴ 90

Hi all

I'm a little confused about where all data is stored and how to retrieve the different pieces for a particular GEO study (GSE113957). I've already retrieved the fastq files using sratools, and I'm looking at retrieving sample metadata now. I've also taken a look on biostars already, but there seem to be a couple of methods that get suggested.

retrieve metadata through the run selector:

But I'd like to do it programmatically, or at least be able to download it on our server. So then there's several more options:

Use the Run info CGI
E-utilities URL call
E-utilities command line (Entrez Direct?)

I haven't tried the E-utilities yet, since I've got a metadata file using the Run info CGI:

wget -O ./SRP144355_info.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term= SRP144355'

But the file I've retrieved this way contains more, and different fields than the one retrieved from the run selector.
Could someone point out what the difference is, or if there is a preferred method?

SRA GEO • 2.7k views

ADD COMMENT • link updated 5 weeks ago by Wayne ★ 2.1k • written 2.8 years ago by Mathias ▴ 90

0

Entering edit mode

Hi When I run the command you mentioned above (command below), there is no content in the file.

wget -O ./SRP144355_info.csv 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP144355'

And when I run the command on the web page, it shows "HTTP ERROR 400", do you know how to solve it?

ADD REPLY • link 22 months ago by claracen2021 • 0

GenoMax · Answer 1 · 2022-05-11

You should be able to get information from SRA using Entrezdirect (there are 143 samples showing two examples):

$ esearch -db sra -query PRJNA454681 | efetch -format runinfo 
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR7093892,2018-11-17 11:42:03,2018-05-02 14:26:33,22292412,1671930900,0,75,565,,https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR7093892/SRR7093892,SRX4022539,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,NextSeq 500,SRP144355,PRJNA454681,3,454681,SRS3243030,SAMN09011827,simple,9606,Homo sapiens,GSM3124643,,,,,,,no,,,,,GEO,SRA698774,,public,A04A18FF048292A7C08F44610FF9644F,9D194CE3DBD0D7663327F15C40DA1110
SRR7093893,2018-11-17 11:42:03,2018-05-02 14:23:32,11074462,830584650,0,75,281,,https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR7093893/SRR7093893,SRX4022540,,RNA-Seq,cDNA,TRANSCRIPTOMIC,SINGLE,0,0,ILLUMINA,NextSeq 500,SRP144355,PRJNA454681,3,454681,SRS3243029,SAMN09011826,simple,9606,Homo sapiens,GSM3124644,,,,,,,no,,,,,GEO,SRA698774,,public,2D1372BD93EBE81264A845C294738123,1A74D57F233EB2B791D317ADED0C404F

score 0 · Answer 2 · 2025-01-31

ffq offers metadata retrieval from the SRA.

See Gálvez-Merchán, Á., et al. (2023) 'Metadata retrieval from sequence databases with ffq'

ffq installation gets you both a command line and Python module. ffq installation gets you both a command line and Python module. (See note below how on some systems accessing the command line way isn't that easy and so an equivalent is provided in the example.) From the paper, it seems it uses NCBI Entrez programming utilities under the hood.

Specific example with ffq paralleling GenoMax's example

Let's assume you are doing this in Jupyter, which you can actually do without installing anything on your machine, or even logging in, by running the following in your temporary Jupyter session started from here by pressing 'launch binder':

%pip install ffq pyjq

We'll use pyjq later, but install it now. Restart the kernel after that using Kernel > Restart Kernel....

Run the following using ffq:

!ffq SRP144355 -o SRP144355.txt

(Leave off the exclamation to do that in a terminal.)

You'll get something like this below with ... being used to truncate here for display because as GenoMax points out in his post, "(there are 143 samples showing two examples)".

For stderr:

[2025-01-31 20:12:29,901]    INFO Parsing Study SRP144355
[2025-01-31 20:12:30,107]    INFO Getting Sample for SRP144355
/srv/conda/envs/notebook/lib/python3.10/site-packages/ffq/utils.py:1082: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  return BeautifulSoup(
[2025-01-31 20:14:51,501] WARNING There are 143 samples for SRP144355
[2025-01-31 20:14:51,501]    INFO Parsing sample SRS3242949
[2025-01-31 20:14:51,691] WARNING Failed to parse sample information from ENA XML. Falling back to ENA search...
[2025-01-31 20:14:51,821]    INFO Getting Experiment for SRS3242949
[2025-01-31 20:14:51,821]    INFO Parsing Experiment SRX4022459
...

And as the output, you'll get:

{
    "SRP144355": {
        "accession": "SRP144355",
        "title": "Predicting age from the transcriptome of human dermal fibroblasts",
        "abstract": "There is a marked heterogeneity in human lifespan and health outcomes for people of the same chronological age. 
...

Next we'd need to read that in and parse it.

You have a couple of options in this case to use ffq to get the json into the variable we'll call data here.
In fact, instead of having to read that in, if using a Python kernel in Jupyter you can run this in cell to skip reading it back in:

%%capture out
import sys
from ffq.main import main
sys.argv = ['ffq','SRP144355']
main()

(Note that option will work in cloud environments like Anaconda Cloud, whereas the terminal or option to use ffq with an exclamation point in a cell may not work as pip doesn't easily install the command line interface everywhere.)

In subsequent cells you can access the output with out.stdout, such as:

data_str = out.stdout

(The cell magic %%capture that Jupyter possesses is a nice covenience, but may confuse those unfamilar as it is using elements of shell and Python & so I am specifically pointing it out.)

Then get the stdout string into a json object data with:

data = json.loads(out.stdout)

But what if you went with the command !ffq SRP144355 -o SRP144355.txt?

For reading the saved file made by !ffq SRP144355 -o SRP144355.txt, you can use:

import json

with open("SRP144355.txt") as f:
    data = json.load(f)

Either way you went, you should have data at the end of those steps. (Running type(data) will give you dict because it is a dictionary to Python.)

And you can use pyjq to parse or just plain Python. We'll take advantage of the json structure in this example and use keys.

We can extract all the SRS accessions and SRR accessions with the following:

import json
import pyjq

# Assuming your data is loaded as:
# with open("SRP144355.txt") as f:
#     data = json.load(f)

# Get all SRS accessions
srs_query = '.[] | .samples | keys[]'
# Alternative query that gets the accessions from the full objects:
# srs_query = '.[] | .samples | .[] | .accession'

# Get all SRR accessions
srr_query = '.[] | .samples | .[] | .experiments | .[] | .runs | keys[]'
# Alternative query that gets the accessions from the full objects:
# srr_query = '.[] | .samples | .[] | .experiments | .[] | .runs | .[] | .accession'

def extract_accessions(data):
    # Extract both types of accessions
    srs_accessions = pyjq.all(srs_query, data)
    srr_accessions = pyjq.all(srr_query, data)

    print("SRS Accessions:", srs_accessions)
    print("\nSRR Accessions:", srr_accessions)

    # Print counts
    print(f"\nFound {len(srs_accessions)} SRS accessions")
    print(f"Found {len(srr_accessions)} SRR accessions")

    return srs_accessions, srr_accessions
srs_accessions, srr_accessions = extract_accessions(data)

That will give you 143 of each.

Now like GenoMax did for two examples, we can iterate on two of those SRR7093892 & SRR7093893:

for acc in srs_accessions:
    if acc == 'SRS3243030' or acc == 'SRS3243030':
        sys.argv = ['ffq',acc]
        main()

That gives the result that starts out like so:

{
    "SRS3243030": {
        "accession": "SRS3243030",
        "title": "98_17yr_Male_Caucasian",
        "organism": "Homo sapiens",
        "attributes": {
            "INSDC secondary accession": "SRS3243030",
            "NCBI submission package": "Generic.1.0",
            "disease": "Normal",
            "ethnicity": "Caucasian",
            "organism": "Homo sapiens",
            "Sex": "male",
            "cell id": "GM07753",
            "age": "17",
            "source_name": "Skin; Unspecified",
            "BioSampleModel": "Generic",
            "ENA-FIRST-PUBLIC": "2022-03-29",
            "ENA-LAST-UPDATE": "2022-03-29"
        },
        "experiments": {
            "SRX4022539": {
                "accession": "SRX4022539",
                "title": "GSM3124643: 98_17yr_Male_Caucasian; Homo sapiens; RNA-Seq",
                "platform": "ILLUMINA",
                "instrument": "NextSeq 500",
                "runs": {
                    "SRR7093892": {
                        "accession": "SRR7093892",
                        "experiment": "SRX4022539",
...

If you wanted to do that for all the SRS accesnsions, just delete the conditional to make it like so:

for acc in srs_accessions:
    sys.argv = ['ffq',acc]
    main()

Adapt the code as you see fit using some Python.

The equivalent of that penultimate Python code block could also be run with:

for acc in srs_accessions:
    if acc == 'SRS3243030' or acc == 'SRS3243030':
        !ffq {acc}