We are setting up a pipeline that takes in single cell Gene Expression Matrices, runs them through a series of various preprocessing steps, and then trains various machine learning models to generate classifiers for label(s) on those datasets.
We'd like to build a collection of 1k datasets to test our pipeline against (~1% of GEO's GSE collection--the number could vary depending on submitted scRNAseq experiments with labels).
We are using the Bioconductor packages GEOquery and GEOmetadb. So far it's hard to figure out which GSEs have GEMs. Some do, some don't. Some just have links to GSMs. I wonder if I'm doing something dumb, or if most GSEs don't include GEMs?
Maybe someone with more experience using GEO could have some advice?
Thanks Kevin! What I found was that for single cell experiments, the common practice was to store the matrices in the supplementary files section of a GSE entry. I was able to download and extract many thousands of matrices. More cleaning and careful mapping work to do, but looking forward to getting some results soon, and training some DL models to best other algorithms :).
Great, hope that it all goes well!