Beauty: ARCHS4 by A. Lachmann is a project which gives easy access to many GEO database genes expressions datasets ( Nature paper 2018 )
Pain: it easy to extract dataset by GSE-id , but it is not clear to me (and some my colleagues) how to understand is it single-cell or bulk expression dataset ?
Is there any way to do it in automatic way ?
May be there is some list GSE-ids of single-cell datasets somewhere in inet ?
(Then I can check with this list, and take from ARCHS4 only those ids which are in it).
Or may be there is some easy way to parse GEO web-page by GSE-id and some fields will contain information on single or bulk data ?
Or some other trick ?
There is no automated way by best knowledge, at least no NCBI built-in function, but others may proof me wrong. Can you link a relevant accession, then we can try to point out some relevant points that may help.
Thank you for your remark ! We are looking on ARCHS4 collection - so there are about 300 datasets with sample number greater than 100 for human and about the same for mouse. So we want to benchmark some our algs on ONLY single-cell datasets, no so clear how to distinguish single cell from bulk without much pain. Looking manually on 300+300 datasets kind of unpleasant) Some info on these datasets can be found e.g. here (scroll up few lines above the place linked): https://www.kaggle.com/alexandervc/archs4-extract-datasets-by-gse-and-show-info?scriptVersionId=68318866&cellId=16 - that is info in data included in ARCHS4.
PS additional problem - some datasets like: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85917 contains BOTH single ( about 365 records ) and bulk ( about 49 records ) , while ARCHS4 stores both single and bulk data.
Using EntrezDirect you can download information about the bioproject for this accession.
gets you
Which of these are single cell and which are bulk (I don't see a distinguishing feature in this output) so I can check. One example of single cell is fine.
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE85917 we can see:
............
GSM2287748 P24_H9b3s_095
GSM2287749 P24_H9b3s_096
GSM2287750 P48_H1_bulk_249
GSM2287751 P48_H1_bulk_250
.....
so last 48 GSM are bulk, and the others are single - that corresponds to paper