For some background we have a mouse model where we can induce B cell lymphomas in mice, and we have done RNAseq on a number of these tumors and identified that they are highly heterogenous (which matches what is seen in humans). We are interested in doing some more experimental studies to identify certain genes that may or may not be important for the malignant cells, but we don't know which cells lines model which tumor subtype the best. I had an idea to try and plot RNAseq data for cell lines that have been derived from B cell lymphomas and then plot them on a PCA plot to see what subtype they are most similar to.
Therefore I pulled RNAseq data for a cell line that was derived from a mouse B cell lymphoma (which was induced with the same method we use), but I'm not sure if in trying to normalize for the inherent differences between cell lines and tissue (as well as differences in sample prep and sequencing methods), I am over-normalizing the data. If this is the case, is there anyway to properly do this? Would pulling a healthy control from their study help?
Also just for reference, to get the counts I pulled the original fastq files from the study I found, and ran them through the same STAR/featureCounts commands I ran my samples through.
Here are the steps I took using DESeq2. The cell line name is "S11E", and I just added it to the front of the rest of my samples.
data <- data %>% column_to_rownames(., var = "Geneid") %>%
dplyr::select(c(S11E, Q01, Q12, T02, BA05, BA07, BA09, Q09, Q13, AAI02, AB05, Q16, Q08, Q18, AAI04, AAM01, AAV01, AG07, AI02, BH11,
AAH01, AG01, AG03, AK01, AAE01, AG05, AH01, AY01, AAB03, AAK01, AA02, AC05, AH02, AX02, B02, C05, C09, F04, F08, Q17))
condition <- c("Lymphoma","Healthy","Healthy","Healthy","Healthy","Healthy","Healthy","Healthy","Healthy",
"LPD","LPD","LPD","LPD","LPD",
"Lymphoma","Lymphoma","Lymphoma","Lymphoma","Lymphoma","Lymphoma",
"Lymphoma","Lymphoma","Lymphoma","Lymphoma",
"Lymphoma","Lymphoma","Lymphoma","Lymphoma",
"Lymphoma","Lymphoma","Lymphoma","LPD","LPD","Lymphoma","Lymphoma",
"LPD","Lymphoma","Healthy","LPD","Lymphoma")
Origin <- c("Cell_Line","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue",
"Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue",
"Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue",
"Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue","Tissue",
"Tissue","Tissue","Tissue")
coldata <- data.frame(row.names = colnames(data), Origin, condition)
dds <- DESeqDataSetFromMatrix(countData = data, colData = coldata, design = ~ Origin + condition)
dds <- DESeq(dds)
vsdata <- vst(dds, blind = FALSE)
assay(vsdata) <- limma::removeBatchEffect(assay(vsdata), vsdata$Origin)
plotPCA(vsdata, intgroup = "condition", pcsToUse = c(1,2))+
geom_label(aes(label = name, alpha = 0.5))
Heres the resultant PCA plots before and after running the removeBatchEffect command:
The biggest tip off/concern I have is that the S11E sample is just getting moved to (0,0) on the plot. Any help is greatly appreciated!
I would not put a cell line and primary cells into the same analysis, regardless of the normalization and integration method. Much (or most) of the cellular transcriptome will mimic the adaptation of the cell to the culture, nutrients, oxygen, absence of stromal interactions, all of that. Cell lines are very different from primary cells. In terms of analysis, why not curating some sort of canonical signature for your entity and then checking across several cell line candidates which resembles this most? Not that this is robust because cell lines as I understand come from all sorts of datasets and batches, but it's what I would try. I remember hearing talks from lymphoma people on DLBCL, and they always had like 20 cell lines in their experiments that behaved grossly different despite being the same cancer, but at least they shared the transcriptomic and proteomics core features of the disease, so this is what I would go after. Not whole transcriptome similarity, that's too different.