Hi,
I have a list of RNA-Seq from cancer samples to which I analyzed and generated the RPKM and TPM (calculated from RPKM) for all of the genes. Additionally, I downloaded a large set of public cancer samples (multiple cancer types) which I want to use as a comparison data and also generated the RPKM and TPM similarly.
What I want to do is to find the public sample that matches the best to my samples (cancer type wise) by comparing RPKM/TPM profiles accross a set of genes (N) of interest (or known to be expressed for each cancer type). I read that using RPKM is a bad idea, so I switched to TPM to do this task.
For each one of my samples, I take the TPM distribution of the N genes and compare it with the TPM distribution of each of the public samples and get a p-value (e.g., KS test). But this does not seem to be the best idea.
Can anyone guide me to another type of tests that can be usefull in my case? or can one for example build a model of TPM of all cancer samples of the same type then compare my samples back each of these models?
Thanks.
Instead of using a KS or other test to find the closest sample, have you tried just computing the correlation and taking the sample the highest one?
Thank you Devon. I thought correlation is too simple, but i will give it a try. do you suggest doing any kind of normalization or batch effect fixing? or directly perform the correlation on the TPMs?
The whole thing is one giant batch effect, so you have no hope of getting around that without using something like RUVseq (assuming you can find some genes expected not to be DE in cancer). Personally, I'd start with a rank-based correlation of the TPM values and see if anything is particularly close. This is sort of a worst case scenario with human samples and cancer, I'm surprised whoever collected the patient samples didn't try to get neighboring tissue.
As an aside, you could theoretically try to do signal separation of the (presumably heterogenous) cancer samples and use the results as matching pairs (you may recall that Katarzyna presented a journal club paper related to that). I've always been highly skeptical of those methods applied to RNAseq data, but perhaps the SVM-based ones are OK at that (as opposed to the ICA-based methods).
FYI, The correlation did not yield the expected results. On the other hand, I tried RUVSeq and it seems to reduce the variability between the samples of the same cancer type quite nicely. I'm trying to use the PCA function as a method to see where my sample would fall in the PCA plot of each cancer type (or where all cancer types are PCA'de).
I remember the talk of Katarzyna, but I don't remember in details how did she address this problem..
I'd have to look up the paper, but it was using some sort of SVM-based method to do signal separation. You might find some relevant methods if you google "SVM signal separation RNAseq".