PCA on RNAseq time series data
3
1
Entering edit mode
5.3 years ago

Hi All, I am investigating my RNAseq data for batch effects and then attempting to remove them. The data file I am trying to use contains raw counts of gene expression values every two days for 2 days (~55,000 rows and 24 columns each representing a different time-point). The RNAseq experiment was done by pooling 3 timepoints per sample with a total of 8 samples. I'm pretty sure the first thing I have to do is a PCA test to look for batch effects and then use something like Limma to remove batch effects. I have looked at tutorials for PCA analysis but am completely lost since I have very limited experience in R. I am taking a R course in a month, but I wanted to try to do this before then. I'm wondering if anyone knows of a simple way I can go about doing this and get around my lack of R knowledge (a shiny app or the like), it would be greatly appreciated. Thanks in advance.

RNA-Seq gene PCA • 2.5k views
ADD COMMENT
0
Entering edit mode

Read http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html for inspiration. There is example code for PCA from gene expression data.

ADD REPLY
0
Entering edit mode
5.3 years ago
alserg ▴ 1000

You can do a PCA plot in our tool Phantasus: https://genome.ifmo.ru/phantasus/ or https://artyomovlab.wustl.edu/phantasus/

Open your dataset (xlsx or txt file) with "Open your own file menu/My computer". Do Tool/Ajust, check "One plus log2" and "quantile normalization". Then Tools/Plots/PCA plot

ADD COMMENT
0
Entering edit mode
5.3 years ago
adam.faranda ▴ 110

In R, I typically use the 'prcomp' function to evaluate principle components in "sample by gene" expression matrices. The object returned by this function can be passed directly to plotting functions. I would recommend the "autoplot" function from the package "ggfortify". Depending on whether you want to plot genes or samples, you may need to transpose the matrix of expression values.

    pca<-prcomp(df, scale=T)
    p<-autoplot(pca, data = ft, colour=groupCol)
    p

In this case "df" is my matrix of expression values, and "ft" is sample metadata that I used to color the points.

ADD COMMENT
0
Entering edit mode

Don't you want to transpose your df given it is a standard expression matrix with samples as columns and rows as genes?

ADD REPLY
0
Entering edit mode

Yes, sorry. The code I posted was a snippet from a larger function -- in order to plot samples, the matrix should be transposed such that rows are samples and columns are genes.

    pca<-prcomp( t(df), scale=T)
    p<-autoplot(pca, data = ft, colour=groupCol)
    p
ADD REPLY
0
Entering edit mode
5.3 years ago

I'm pretty sure the first thing I have to do is a PCA test to look for batch effects and then use something like Limma to remove batch effects.

Note that if you are using software like DESeq or EdgeR to find differentially expressed genes, you do not remove batch effect, you just include batch as an element in the linear model design.

ADD COMMENT
0
Entering edit mode

Hi @swbarnes2 -- I'm working on a similar type of project as the OP; but I still have a lot to learn about mitigating batch effects. If you include the batch covariate as a factor in the model design matrix, is that sufficient? I'm also wondering how you might apply to cluster analysis. Is there a way to obtain "batch corrected" cpm values from edgeR that could then be clustered?

ADD REPLY
0
Entering edit mode

If you plan on using DESeq, read down to Michael Love's post

https://support.bioconductor.org/p/121408/

You can't get a better answer than that.

ADD REPLY

Login before adding your answer.

Traffic: 1848 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6