I need to retrieve and download a large number of fastq files from the GEO or SRA database. Is there a way to determine whether it is 10x data before running cellranger?
I need to retrieve and download a large number of fastq files from the GEO or SRA database. Is there a way to determine whether it is 10x data before running cellranger?
You can search for "10x genomics" AND lung in GEO to return 2853 datasets. I would double check the extract protocol and data processing steps in the series matrix file to make sure that it's actually 10X and then use the SRA Run Selector to get the Run accessions for downloading. You can also check CellXGene for 110 lung datasets or Single Cell Portal for 150 lung datasets since both list the specific 10x assay and let you explore and download the clustered datasets
There is a very big difference between 2853 and 150. The question seems to be: given the 2853 GEO accessions, is there a way to flag them as "contains 10X data" versus "references 10X in field but no 10X data."
I don't know how, btw. I'd probably cop out and use an llm as a first pass :p
Actually, a better search is to use 10x[Description] AND Lung (3990 datasets). You can use the advanced search builder and select Description, type 10X and then click show index list to maybe find specific assays, but there are way too many badly formatted values to find the right ones. It's much easier to download the series matrix files and search the first two columns in these files to figure out if it's 10X GEX, VDJ, ATAC, Multiome, FLEX, Spatial, CNV or others since all these require different cellranger pipelines.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Just asking with all due respect, does it make a lot of sense to batch download and process a large number of datasets without actually doing a careful and thoughtful curation of metadata first, to ensure that you're actually working with data that can asnwer your scientific question?
Yes, so I want to ask if it is possible to pre-filter the data through metadata before downloading it.
Can you provide a few example accessions? It may be possible to do this by looking at the metadata. Fastq data for 10x in SRA can be hit or miss in general.
In fact, I may need any single-cell sequencing data of lung tissue (normal or cancerous). So I randomly picked several possible data for you: GSE236587, GSE278089, and GSE279114. Since I need specific tissues, it seems that searching for data from GEO is a better idea, but because of the research content, I need raw data, so I have to download SRA data.
You could just use kallisto / bustools or alevin to process a few hundred thousand reads and see what the results look like (obviously, if it's not the correct technology, then your resulting count matrix will have very few barcodes and very counts). Those programs are much faster than cellranger and could be a good way to check whether a set of FASTQ files is 10x, before diving into running cellranger.
Note that there are many versions of the 10x protocol.
Sorry for my wording, what I meant to ask is whether it is possible to pre-filter the data by metadata before downloading it. Because if the data is not 10x data, cellranger will report an error before running.
for a different purpose, take a look at https://github.com/Nusob888/fasterqParseR