Hi all,
I have two matrices with RNAseq UMI counts, sequenced in 2 different moments (from same person, same lab, but different timing).
- Matrix-1 contains samples of: Primary breast cancer, Metastasis (several different tissue), normal breast tissue.
- Matrix-2 contains samples of normal tissue of the site of one of the metastasis (normal tissue of just one of the metastasis).
What i will have to work on are only the primary and metastasis samples. However, in the future i might need to do some analysis using the normal samples too.
So first, I have to perform normalization, which i am going to do with vst() function from the R package DESeq2. My question is: should i normalize separately Matrix-1 and Matrix-2? Or merge the 2 matrices and normalize all together? Or drop the normal breast samples from matrix-1 and normalize the matrix with only the primary and metastasis samples?
I have to work mostly on primary and metastasis, but to avoid to go back again to raw counts and normalize again the data, I wanted to normalize and keep in a folder the normal samples too.
Thanks for any help/suggestion!
I think it really depends on the questions you want to answer. If you really only want to know gene expression differences between Primary Breast Cancer Tissue, and matching Metastatic Cancers from other tissues, you would need only the first two parts of your first matrix, so you would simply normalize those two conditions. But to be honest, I don't see how this is useful - but it's not my experiment. On the other hand, if your question is: what genes are differentially expressed between Primary and Metastatic cancer?...then you would put all the data together and look at genes responding differently between Primary Breast Cancer and Normal Breast, and Metastatic Cancer and Normal Tissue. For this you would need all the data in a single matrix. Having all the data together in a single matrix allows you capture all the variability of your genes in the space about which you want to ask specific questions.
Either way, "but to avoid to go back again to raw counts and normalize again the data" this is never a reason to choose a normalization strategy. You will likely analyze the data many times, and the tough work of normalization is simply a function call.
The answer really depends on the questions you want to ask.
Hi Seidel, thank you for you reply!
Let's say my concern was more to have only one expression matrix to work on, so that i can compare the results of several analysis. The expression matrix of only normal samples contains about 5k more genes that the matrix with tumor samples. So in order to work on the same list of genes, I was considering to work on the same matrix. So what you would suggest is to:
Thanks a lot for the help!