Hi there, I am new to this website and was hoping someone could answer some questions for me. I have two datasets. One is a control dataset and the other is a dataset where patients have a condition. Both are taken from human lungs, however, they are two entirely separate experiments. I have applied batch correction within each of the datasets to account for technical differences between the samples. I now want to do a differential expression analysis and also run the cellassign cell typing algorithm... so here are my questions:
1) Should I just use the original raw count matrices from both experiments for differential expression analysis? I have read in a paper that it is best to do this so that you have a more conservative test.
2) Do I need to use batch correction/integration to integrate the two datasets (control/condition) before differential expression analysis between the experiments? For instance, if I wanted to test the expression of a particular gene between the control and condition datasets... do I need to have run integration beforehand to integrate the two datasets?
3) If I run cellassign for cell typing the dataset, should I run the algorithm for the two datasets separately or together? Could I just run it for all of the individual samples from each dataset separately?
Finding it difficult to sift through all of the papers to find best practices. Any help is appreciated and thanks for you time!
Just to clarify, is this bulk or single-cell RNA-seq?
Sorry! Should have specified... it is single cell
Integration itself is not meant to facilitate DE testing, it is rather a way of creating an unified clustering landscape, see http://bioconductor.org/books/release/OSCA/integrating-datasets.html#using-corrected-values The chapter 13 in this book nicely describes the integration idea.
Can you describe the experiment a bit more, especially how the two datasets separate (from the technical standpoint) and which groups of cells you want to compare? You want to compere between datasets or within datasets, and are both datasets the same (like biological replicates) or is each dataset an experimental condition?
One is a dataset of lung samples from a sample of healthy patients and then the other is lung samples from patients with a condition (completely separate experiments). I believe they both used drop-seq technology. The batch effect was quite pronounced between the different samples in the healthy patients (not sure why) when I created a UMAP plot and colored by sample ID. I performed integration on both datasets separately (to integrate the samples within each dataset/experiment), which I have read is synonymous to batch effect correction.
I would like to understand: 1) the difference in cell type proportions between healthy and condition groups 2) the difference in gene expression between healthy and condition groups 3) the difference in gene expressions within specific cell types between healthy and condition groups
So you're saying above that the original counts should be used for DE testing and the batch corrected should be used for visualization/clustering.