Facing problem with extracting phenotypic data from GEO
2
0
Entering edit mode
8 months ago

Hi. I'm performing WGCNA analysis using GSE152418 dataset. This dataset has 34 samples. However, I'm getting phenotypic data for only 6 samples. I've run the code for other GEO datasets and I'm getting only 6 samples for those as well. How can I resolve the issue?

Here's my result and code:

Result

# Get the metadata
geo_id<-"GSE152418"
gse<-getGEO(geo_id,GSEMatrix = TRUE)
pheno_data<-pData(phenoData(gse[[1]]))
a<-head(pheno_data)
View(a)

Thank you so much.

WGCNA GEO • 1.2k views
ADD COMMENT
1
Entering edit mode
8 months ago
noodle ▴ 590

NCBI came out with a cloud-based solution to query metadata. It might be worthwhile for you to look into that. Below I show the top row for what gets returned when using the AWS-Athena query below. Importantly, under the 'attributes' column you'll find many key/value pairs that are searchable within the query.

SELECT *
FROM metadata
WHERE bioproject = 'PRJNA639275' 

acc assay_type center_name consent experiment sample_name instrument librarylayout libraryselection librarysource platform sample_acc biosample organism sra_study releasedate bioproject mbytes loaddate avgspotlen mbases insertsize library_name biosamplemodel_sam collection_date_sam geo_loc_name_country_calc geo_loc_name_country_continent_calc geo_loc_name_sam ena_first_public_run ena_last_update_run sample_name_sam datastore_filetype datastore_provider datastore_region attributes jattr run_file_version 1 SRR12007843 RNA-Seq GEO public SRX8541000 GSM4614996 Illumina NovaSeq 6000 SINGLE cDNA TRANSCRIPTOMIC ILLUMINA SRS6835808 SAMN15230281 Homo sapiens SRP267176 2020-07-31 PRJNA639275 324 101 1111 [sra, run.zq, fastq] [gs, s3, ncbi] [ncbi.public, gs.US, s3.us-east-1] [{k=geo_accession_exp, v=GSM4614996}, {k=bases, v=1111010201}, {k=bytes, v=340456699}, {k=run_file_create_date, v=2020-06-13T12:18:00.000Z}, {k=cell_type_sam_ss_dpl37, v=PBMC}, {k=days_post_symptom_onset_sam, v=13}, {k=disease_state_sam, v=COVID-19}, {k=gender_sam, v=male}, {k=geographical_location_sam, v=USA: Atlanta, GA}, {k=severity_sam, v=ICU}, {k=source_name_sam, v=PBMC}, {k=primary_search, v=15230281}, {k=primary_search, v=639275}, {k=primary_search, v=GSE152418}, {k=primary_search, v=GSM4614996}, {k=primary_search, v=GSM4614996_r1}, {k=primary_search, v=PRJEB40771}, {k=primary_search, v=PRJNA639275}, {k=primary_search, v=SAMN15230281}, {k=primary_search, v=SRP267176}, {k=primary_search, v=SRR12007843}, {k=primary_search, v=SRS6835808}, {k=primary_search, v=SRX8541000}] {"geo_accession_exp": ["GSM4614996"], "bases": 1111010201, "bytes": 340456699, "run_file_create_date": "2020-06-13T12:18:00.000Z", "cell_type_sam_ss_dpl37": ["PBMC"], "days_post_symptom_onset_sam": "13", "disease_state_sam": ["COVID-19"], "gender_sam": ["male"], "geographical_location_sam": "USA: Atlanta, GA", "severity_sam": "ICU", "source_name_sam": ["PBMC"], "primary_search": "15230281"} 1

ADD COMMENT
1
Entering edit mode

Looking for this info seems to lead to https://www.ncbi.nlm.nih.gov/sra/docs/sra-athena/. I have not fully explored this but it seems to require an AWS account and may require payment (if the docs are right then perhaps a small one).

This information can also be obtained using EntrezDirect:

$ esearch -db sra -query PRJNA639275 | efetch -format runinfo

Then searching for a specific sample (truncated to save space)

$ esearch -db sra -query SRR12007881 | efetch -format native
<?xml version="1.0" encoding="UTF-8"  ?>
<EXPERIMENT_PACKAGE_SET>
<EXPERIMENT_PACKAGE><EXPERIMENT alias="GSM4615034" accession="SRX8541019"><IDENTIFIERS><PRIMARY_ID>SRX8541019</PRIMARY_ID></IDENTIFIERS><TITLE>GSM4615034: S183_263; Homo sapiens; RNA-Seq</TITLE><STUDY_REF accession="SRP267176" refname="GSE152418"><IDENTIFIERS><PRIMARY_ID>SRP267176</PRIMARY_ID></IDENTIFIERS></STUDY_REF><DESIGN><DESIGN_DESCRIPTION/><SAMPLE_DESCRIPTOR accession="SRS6835827"><IDENTIFIERS><PRIMARY_ID>SRS6835827</PRIMARY_ID><EXTERNAL_ID namespace="GEO">GSM4615034</EXTERNAL_ID></IDENTIFIERS></SAMPLE_DESCRIPTOR><LIBRARY_DESCRIPTOR><LIBRARY_STRATEGY>RNA-Seq</LIBRARY_STRATEGY><LIBRARY_SOURCE>TRANSCRIPTOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>cDNA</LIBRARY_SELECTION><LIBRARY_LAYOUT><SINGLE/></LIBRARY_LAYOUT><LIBRARY_CONSTRUCTION_PROTOCOL>Total RNA was purified using the Qiagen miRNeasy Mini Kit cDNA was generated using the Clontech SMARTseq v4 Ultra Low Input kit and libraries were prepared for sequencing using the Illumina NexteraXT DNA kit</LIBRARY_CONSTRUCTION_PROTOCOL></LIBRARY_DESCRIPTOR></DESIGN><PLATFORM><ILLUMINA><INSTRUMENT_MODEL>Illumina NovaSeq 6000</INSTRUMENT_MODEL></ILLUMINA></PLATFORM><EXPERIMENT_LINKS><EXPERIMENT_LINK><XREF_LINK><DB>gds</DB><ID>304615034</ID><LABEL>GSM4615034</LABEL></XREF_LINK></EXPERIMENT_LINK></EXPERIMENT_LINKS><EXPERIMENT_ATTRIBUTES><EXPERIMENT_ATTRIBUTE><TAG>GEO Accession</TAG><VALUE>GSM4615034</VALUE></EXPERIMENT_ATTRIBUTE></EXPERIMENT_ATTRIBUTES></EXPERIMENT><SUBMISSION alias="GEO: GSE152418" broker_name="GEO" center_name="GEO" submission_comment="submission brokered by GEO" lab_name="" accession="SRA1086746"><IDENTIFIERS><PRIMARY_ID>SRA1086746</PRIMARY_ID><SUBMITTER_ID namespace="GEO">GEO: GSE152418</SUBMITTER_ID></IDENTIFIERS></SUBMISSION><Organization type="center"><Name abbr="GEO">NCBI</Name><Contact email="geo-group@ncbi.nlm.nih.gov"><Name><First>Geo</First><Last>Curators</Last></Name></Contact></Organization><STUDY center_name="GEO" alias="GSE152418" accession="SRP267176"><IDENTIFIERS><PRIMARY_ID>SRP267176</PRIMARY_ID><EXTERNAL_ID namespace="BioProject" label="primary">PRJNA639275</EXTERNAL_ID><EXTERNAL_ID namespace="GEO">GSE152418</EXTERNAL_ID></IDENTIFIERS><DESCRIPTOR><STUDY_TITLE>Systems biological assessment of immunity to severe and mild COVID-19 infections</STUDY_TITLE><STUDY_TYPE existing_study_type="Transcriptome Analysis"/><STUDY_ABSTRACT>The recent emergence of COVID-19 presents a major global crisis. Profound knowledge gaps remain about the interaction between the virus and the immune system. Here, we used a systems biology approach to analyze immune responses in 76 COVID-19 patients and 69 age and sex- matched controls, from Hong Kong and Atlanta. Mass cytometry revealed prolonged plasmablast and effector T cell responses, reduced myeloid expression of HLA-DR and inhibition of mTOR signaling in plasmacytoid DCs (pDCs) during infection. Production of pro-inflammatory cytokines  plasma levels of inflammatory mediators, including EN-RAGE, TNFSF14, and Oncostatin-M, which correlated with disease severity, and increased bacterial DNA and endotoxin in plasma in  and reduced HLA-DR and CD86 but enhanced EN-RAGE expression in myeloid cells in severe  transient expression of IFN stimulated genes in moderate infections, consistent with transcriptomic analysis of bulk PBMCs, that correlated with transient and low levels of plasma  COVID-19. Overall design: RNAseq analysis of PBMCs in a group of 17 COVID-19 subjects and 17 healthy controls</STUDY_ABSTRACT><CENTER_PROJECT_NAME>GSE152418</CENTER_PROJECT_NAME></DESCRIPTOR><STUDY_LINKS><STUDY_LINK><XREF_LINK><DB>pubmed</DB><ID>32788292</ID></XREF_LINK></STUDY_LINK></STUDY_LINKS><STUDY_ATTRIBUTES><STUDY_ATTRIBUTE><TAG>parent_bioproject</TAG><VALUE>PRJEB40771</VALUE></STUDY_ATTRIBUTE></STUDY_ATTRIBUTES></STUDY><SAMPLE alias="GSM4615034" accession="SRS6835827">
ADD REPLY
0
Entering edit mode

This information can also be obtained using EntrezDirect:

ya but, entrez != SRA.metadata ...there exists different data between the two.

NCBI seems to be making it mandatory to use cloud-based resources, at least for some of their datasets. Regarding specifically the SRA metadata, I've exchanged a few emails with them (NCBI) asking to make a publicly accessible SQL-like server for this, but they don't agree. Considering UCSC does this for all of their tables, it doesn't seem like a stretch for NCBI to do this for a few important tables. As far as I can tell, NCBI just put all their effort into creating cloud resources and they don't want to go anywhere else. At least they have both google and AWS. It could be useful to have more people write them, or even create a petition. I can't imagine that maintaining a public SQL-like server would be costly for NCBI...

ADD REPLY
0
Entering edit mode

AFAIK NCBI makes all SRA metadata available via FTP site: https://ftp.ncbi.nih.gov/sra/reports/Metadata/ . If you have the infrastructure and expertise available then downloading the files and parsing/searching them locally may be the easiest option. Entrezdirect is a suite of command line tools that are used to query various NCBI databases.

It would be unfortunate if NCBI chooses to provide/store different data/metadata from different locations.

ADD REPLY
0
Entering edit mode

It would be unfortunate if NCBI chooses to provide/store different data/metadata from different locations.

Seems like (maybe?) that's what has happened. From the above example we can see that the AWS and entrez query have a lot of overlap, but each contains data unique to that search option. I checked the ftp site, and that data contains everything in entrez and the AWS table...interesting (and a shame) that neither of those search options contain everything in the ftp data. The ftp table is a real hassle to work with though, and you can't really query the data itself directly without a lot of overhead. I suspect NCBIs longer-term plan is to shift to cloud-based resources.

tar -zxvf NCBI_SRA_Metadata_Full_20240321.tar.gz SRA1086746
SRA1086746/
SRA1086746/SRA1086746.study.xml
SRA1086746/SRA1086746.experiment.xml
SRA1086746/SRA1086746.sample.xml
SRA1086746/SRA1086746.submission.xml
SRA1086746/SRA1086746.run.xml
ADD REPLY
0
Entering edit mode

I realized now that the cloud-based options have several tables that can be queried and the table I showed is only the 'metadata' table. There is a 'metadata_json' table, among others, that might return all the info available at the ftp site. In any case, and from my experience, the cloud-based search features are extremely useful - there's just the small price to pay to have access.

ADD REPLY
0
Entering edit mode

joe do you have a sense that in general the cloud based datasets are more complete?

ADD REPLY
0
Entering edit mode

AFAIK, this is the most comprehensive way to search SRA data. If you have a specific question I recommend to write NCBI, they are responsive and helpful. I know there are other ways to find metadata, like entrez or other interfacing tools, but each tool seems to contain different parts of data. This cloud based resource will have everything pertaining to the deposited reads. Also, I know there are other ways to access parts of this data, for example if you change the SRR in the below link you'll find how the data is held. Maybe there is an equivalent for the metadata ...

https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?acc=SRR12007843&accept-alternate-locations=yes

ADD REPLY
0
Entering edit mode
8 months ago
GenoMax 147k

You can download the metadata for all samples here: https://www.ncbi.nlm.nih.gov/Traces/study/?acc=GSE152418%20&o=acc_s%3Aa

ADD COMMENT
0
Entering edit mode

Thank you. I'm sorry for any confusion. I need the phenotypic data.

ADD REPLY
1
Entering edit mode

my sincere apologies if i am the one who is confused, but i think that the URL provided by GenoMax provides what you seek.

in the link provided, there is a column "disease_state" is this not the information that is sought? there are 50 samples provided (perhaps not all analyzed in the manuscript or some such, leaving 34?). please clarify and we will help

best, VAL

ADD REPLY
0
Entering edit mode

I need gender information. I got it. Thank you

ADD REPLY

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6