Seeking Advice on Handling Multiple Datasets for Differential Analysis in Transcriptomics
1
0
Entering edit mode
1 day ago
Riley J • 0

Hello everyone,

I am currently studying a research paper in which the author integrated datasets from multiple public databases and their own samples. I located the data using the SRA numbers provided by the author, but I found that the raw data amounts to over 800 GB. The supplementary materials from the author state that all data were processed uniformly in the upstream analysis. As a self-learner, I don't have the resources to handle such a large volume of data; I only have access to my local computer.

Luckly, the author provided GEO accession numbers in the main text, and each GEO dataset includes expression profile data. However, a new challenge has emerged. The author combined multiple datasets, some of which provide raw counts values, while others provide TPM values, and some even include counts values that has already been normalized. Based on my current understanding, if I want to conduct differential analysis, I need raw counts values. So, I am wondering if it is feasible to download only the non-raw data datasets from SRA and process them into count data myself. I am concerned that inconsistent data processing methods might lead to discrepancies in the results. But this seems to be the best approach I can think of.

I would greatly appreciate it if any experts could advise me on whether this method is reasonable. Alternatively, if anyone has better suggestions or alternative approaches, please share them with me. I am eager to learn and improve.

Thank you in advance for your help :)

Data-Integration Differential-Analysis Transcriptomics Data-Normalization • 192 views
ADD COMMENT
3
Entering edit mode
1 day ago

What do you mean by "non-raw" data? You're correct in that grabbing arbitrary values on different scales and trying to mash them together is unlikely to go well.

While the raw data may be large, scRNA counts tables are generally more manageable. You could process the raw data in batches to get processed counts appropriate for integration downstream.

ADD COMMENT
0
Entering edit mode

Thank you for your reply and suggestion. The term "non-raw" data means that it is not the counts data. For example, some expression profiles downloaded from GEO are in the form of TPM values.

I have reviewed previous posts and now understand that the author conducted a transcriptome meta-analysis. Therefore, I have decided to follow your advice to process the raw data in batches and normalize each dataset individually before conducting the meta-analysis.

I haven't yet learned about transcriptome meta-analysis, so I am currently searching for relevant materials and resources. Your reply has given me the confidence to proceed with my analysis. Thank you again for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6