I have 67 clinical tissue samples which were sequenced for RNA expression. They are in five main groups, with three of the groups linked by being triple samples from the same people taken from different locations, and the other two being taken from two groups of individuals with slightly different clinical info.
To begin a differential splicing analysis, I've been following the vignette for DEXSeq here: https://bioconductor.org/packages/release/bioc/vignettes/DEXSeq/inst/doc/DEXSeq.html, but I want to check whether I can consider all samples in the two "normal" groups the same to use as a baseline, by checking their within-group variability using something like PCA (and remove any outliers if there appear to be).
However, there appears to be no consistent way of running a PCA on the exon count data, whether to use the raw exon feature counts or to scale and centre those which are expressed, whether to log-transform or not, or otherwise normalise by estimated library size. I'm a little confused as to what to do.
Would an approach like that in section 8 of this document: https://www.huber.embl.de/users/klaus/Teaching/DESeq2Predoc2014.html#pca-and-sample-heatmaps, developed for DESeq2 (by the same authors), be a good way forward?
Alternatively, what would be a good way of determining whether the two "normal" groups are similar enough to pool as a baseline?
Yes, PCA is commonly performed on a set of informative (e.g. highly variable based on rowVar) genes based on the normalized and log2-transformed counts. The rowVar would also be performed on the norm/log counts of course. Nothing special with exon counts if you ask me.