Finding and downloading many Gene Expression Matrices from GEO?
1
0
Entering edit mode
6.2 years ago
breckuh ▴ 30

We are setting up a pipeline that takes in single cell Gene Expression Matrices, runs them through a series of various preprocessing steps, and then trains various machine learning models to generate classifiers for label(s) on those datasets.

We'd like to build a collection of 1k datasets to test our pipeline against (~1% of GEO's GSE collection--the number could vary depending on submitted scRNAseq experiments with labels).

We are using the Bioconductor packages GEOquery and GEOmetadb. So far it's hard to figure out which GSEs have GEMs. Some do, some don't. Some just have links to GSMs. I wonder if I'm doing something dumb, or if most GSEs don't include GEMs?

Maybe someone with more experience using GEO could have some advice?

RNA-Seq next-gen rna-seq • 1.5k views
ADD COMMENT
1
Entering edit mode
6.2 years ago

Sounds like a neat project.

Note that the Series Matrix Files, which virtually always contain the normalised expression data, may not always be listed on the GEO accession homepage; however, they can still be downloaded via GEOquery.

The easiest way to carry out your work would be to obtain your list of GEO data-sets of interest and to then download them via:

library(Biobase)
library(GEOquery)
gset <- getGEO("GSE31432", GSEMatrix =TRUE, getGPL=FALSE)

For the vast majority of datasets, the normalsied expression data can then be readily accessed with:

if (length(gset) > 1) idx <- grep("GPL6947", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

You should prepare a list of accession IDs and aim to match them on various attributes in order to reduce bias in your analysis. There are many answers relating to the accessing of GEO data on Biostars. Seán has worked a lot on these, but he's now more active on Bioconductor forum.

Finally, I'm a skeptic about machine learning (ML). Whatever predictive power your algorithm eventually achieves, I guarantee you that I could do better via non-ML based algorithms :)

Kevin

ADD COMMENT
0
Entering edit mode

Thanks Kevin! What I found was that for single cell experiments, the common practice was to store the matrices in the supplementary files section of a GSE entry. I was able to download and extract many thousands of matrices. More cleaning and careful mapping work to do, but looking forward to getting some results soon, and training some DL models to best other algorithms :).

ADD REPLY
0
Entering edit mode

Great, hope that it all goes well!

ADD REPLY

Login before adding your answer.

Traffic: 2723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6