Hi Biostars Community,
I have searched the forum, but couldn't find a perfect answer for this.
I have downloaded some RNA-seq data from GEO (GSE112656). They are basically 11 osteoarthritis samples and 11 rheumatoid arthritis (so 22 in total).
I also have 4 samples of RNA-seq data generated from my lab with three technical replicates (so 12 total).
- Drug A - treated
- Drug A - control (untreated)
- Drug B - treated
- Drug B - control (untreated)
I want to use the GEO (GSE112656) and also my lab data to conduct PCA to see which ones are similar and cluster together and which ones are different.
How would I normalize and transform these datasets before following these directions:
https://bioconductor.org/packages/devel/bioc/vignettes/PCAtools/inst/doc/PCAtools.html
I was considering following this post: Which counts to use for RNA-seq heatmap and PCA?
Basically, I am planning to combine all the samples (22 +12 = 32 samples) into one large data frame and then generate log2 transformed TMM followed by throwing that table into a PCA function to visualize the PC_1 by PC_2 table, or maybe PC_1 by PC_2 by PC_3 to seeing how they cluster/how similar or different they are to one another. I am planning TMM because HBC recommends this to compare between samples and within samples. Would all this be best practices?
Please, if you can suggest something better, I would appreciate it.
Thank you Kevin
You always swoop in to save the day : )
I will try this. I just tried the whole procedure using:
design = ~1
. Result were ehh, but that could just be what they are lol... Now I think i will try to get "fancy" with the design and use actual ones for each of the datasets. They aren't too fancy... one dataset is either OA or RA sodesign = ~ cell
, and the drug A and drug B datasets are similar todesign = ~dex
from the DESeq2 manual.And then finally I will try with edgeR just log2 CPM+1 each dataset individually,
cbind
(column-bind), and then slap that table throughPCAtools
.Thank you again. I strive to be as compassionate as you, Kevin, (you've helped so many people on here and other places, and you continue to) : )
I tried to be explicit in everything here so this post would serve as a good resource - please don't think I think you don't know
cbind
,dex
, etc...If one dataset is RA while the other is OA (I have worked and published in both areas), then it is difficult (even impossible) to remove any effect of
batch
due to the fact thatbatch
is confounded withdisease
(OA or RA). If this is the case, I would process each data independently to the VST stage, and then transform these datasets independently to Z-scores. Then, at least, the datasets are comparable.