How to determine whether it is 10x data
1
1
Entering edit mode
8 weeks ago
king • 0

I need to retrieve and download a large number of fastq files from the GEO or SRA database. Is there a way to determine whether it is 10x data before running cellranger?

GEO 10x SRA • 1.2k views
ADD COMMENT
1
Entering edit mode

Just asking with all due respect, does it make a lot of sense to batch download and process a large number of datasets without actually doing a careful and thoughtful curation of metadata first, to ensure that you're actually working with data that can asnwer your scientific question?

ADD REPLY
0
Entering edit mode

Yes, so I want to ask if it is possible to pre-filter the data through metadata before downloading it.

ADD REPLY
0
Entering edit mode

Is there a way to determine whether it is 10x data before running cellranger?

Can you provide a few example accessions? It may be possible to do this by looking at the metadata. Fastq data for 10x in SRA can be hit or miss in general.

ADD REPLY
0
Entering edit mode

In fact, I may need any single-cell sequencing data of lung tissue (normal or cancerous). So I randomly picked several possible data for you: GSE236587, GSE278089, and GSE279114. Since I need specific tissues, it seems that searching for data from GEO is a better idea, but because of the research content, I need raw data, so I have to download SRA data.

ADD REPLY
0
Entering edit mode

You could just use kallisto / bustools or alevin to process a few hundred thousand reads and see what the results look like (obviously, if it's not the correct technology, then your resulting count matrix will have very few barcodes and very counts). Those programs are much faster than cellranger and could be a good way to check whether a set of FASTQ files is 10x, before diving into running cellranger.

Note that there are many versions of the 10x protocol.

ADD REPLY
0
Entering edit mode

Sorry for my wording, what I meant to ask is whether it is possible to pre-filter the data by metadata before downloading it. Because if the data is not 10x data, cellranger will report an error before running.

ADD REPLY
0
Entering edit mode

for a different purpose, take a look at https://github.com/Nusob888/fasterqParseR

ADD REPLY
0
Entering edit mode
8 weeks ago
Chris S. ▴ 340

You can search for "10x genomics" AND lung in GEO to return 2853 datasets. I would double check the extract protocol and data processing steps in the series matrix file to make sure that it's actually 10X and then use the SRA Run Selector to get the Run accessions for downloading. You can also check CellXGene for 110 lung datasets or Single Cell Portal for 150 lung datasets since both list the specific 10x assay and let you explore and download the clustered datasets

ADD COMMENT
0
Entering edit mode

There is a very big difference between 2853 and 150. The question seems to be: given the 2853 GEO accessions, is there a way to flag them as "contains 10X data" versus "references 10X in field but no 10X data."

I don't know how, btw. I'd probably cop out and use an llm as a first pass :p

ADD REPLY
0
Entering edit mode

Thank you for your answer, it's very inspiring.

ADD REPLY
0
Entering edit mode

Actually, a better search is to use 10x[Description] AND Lung (3990 datasets). You can use the advanced search builder and select Description, type 10X and then click show index list to maybe find specific assays, but there are way too many badly formatted values to find the right ones. It's much easier to download the series matrix files and search the first two columns in these files to figure out if it's 10X GEX, VDJ, ATAC, Multiome, FLEX, Spatial, CNV or others since all these require different cellranger pipelines.

ADD REPLY

Login before adding your answer.

Traffic: 2481 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6