I have a time course study in which there are 4 time points and each time point has 4 biological replicates. My first aim was to find the DE gens during time.
Please consider it that I'm at a very basic level on analyzing RNA-seq data, that's why my questions seem very simple to you.
I have read somethings about heatmaps. As I know, heatmap is a good tool to analyze the data visually (since the data is high dimensional) and to cluster the genes. I also know that, the gene expression values need to be normalized to avoid systematic biases and there are some useful R packages to draw heatmaps.
What I don't know is that:
- Why do you cluster the genes in heatmaps at all?
- in my case where I have the normalized read counts of a time course RNA-seq study, with 4 biological replicates and 4 time points (4 conditions) how should I draw a heatmap? Should I draw the heatmaps for each time point (condition) separately? or should I draw all biological replicates and time points at the same time? which means 16 replicates in a row.
- The number of DEG is about 13000. How should I decide how to cluster them? Should I only draw heatmaps for DEGs or should I do it for all of the genes?
- I have done the normalization by methods such as TMM, Median, Quantile or Total. Is it essential to use FPKM or RPKM values?
- I also read that there is no best method for choosing distance measures in cluster analysis. Does it mean whatever method I choose would be up to me?
I have read some information on biostars which I can refer you to, but none of them could help me understand the answers to my questions:
Heatmap Of Rna-Seq Data
Hierarchical Clustering And Heatmap Analysis For Rna-Seq Data
Clustering Data (Rna-Seq) Using R To Produce A Heatmap
and some other similar posts.
Thanks a lot.
****UPDATE****
I just read that I have to merge the table of my read counts and the table of FRD-adjusted P-values for each gene in one table, and then draw the heat maps for most significant genes
Thank you very much. I'm trying to compare two models. So I haven't changed the default settings for each model and that is why they give me a high number of DE genes. I want to figure out how close these two models work by default by comparing their results on finding the index genes and their number of overlapped genes. But the most significant genes are less than 8k. Still too much, huh?
I'm going to plot the heatmap for the top 100 DEGs by their FDR-adj.p-values<0.05 seperately for each model.
To practice, I plotted two heatmaps using two different commands to see what happens though I still have problems interpreting these maps.
Yea, 8k seems like a lot. I don't think it's out of the realm of possibility, but it's unlikely. I'm assuming that most are marginally DE. I'll typically do a default analysis (pooled-CR, i believe in deseq2), then filter by adjusted pval and log fold change either >2 or less than 0.5.
Make sure you're scaling the gene features like in the code example above. Otherwise, it may be difficult to see DE in the heatmap, because some gene targets may be very highly or lowly expressed and throw off the coloring scale. For example, say I have 2 DE genes. One has a baseline of 1000, and is increased to 2000. The other has a baseline of 10, and is increase to 20. If you don't scale, then the first DE gene will be all red, and the second will be all blue, because the scale range for the heatmap is effectively 10-2000.
One of the models provides me with the log2fold changes and p-values, but the other one doesn't. So I don't know how to further filter the DEGs by log2fold change for the second model. it only gives me the DEGs, and their PPs (posterior probabilty).
and thanks for mentioning the code above. I took a look at the manual of DESeq. I will scale my normalized read counts matrix. The info were really useful. Please if there is anything else I need to know, let me know. Thanks
hm, that's a thinker. ill have to search around and see how other people are filtering assuming a bayesian framework. do you have priors for your genes as well? honestly ill have to admit, this isn't exactly an area of my expertise.
I once was blamed to ask simple questions (I know I'm not knowledgeable enough), so I don't dare to use my former account because I don't want to be blamed again. I made a new account but it seems I can't comment more than 5 times a day. In order to answer your question, I had to log in with my former account.
This is a Bayesian approach applied in EBSeq-HMM for time course studies. According to what they say, "Under a target FDR = 0.05, we call genes with PP(remain constant)< 0.05 as DE genes".
PP: posterior probability of being each expression path
FDR: I'm not sure but I guess that's FDR-adjusted P-values.
anyway, the output is a matrix with columns: DE genes' names, their most likely expression path (like UP-UP-UP), PP.
I'm not sure what you mean by priors. I just know I have prior probabilities of being each path as well.
I think DEGs with highest PPs, are the most significant ones. Because when you ask the model to give you the head genes, it gives you the genes with the highest PPs.
sorry to hear about your bad experience. ive always like biostars because i think the community is much more reasonable and nice. imagine asking a simple question on stack exchange, jesus they would murder us all if they could. :)
ill look around and see if i cant find any filtering criteria for EBSeq-HMM, as im always interested in learning new methods.
Yes, I like biostars. People here are knowledgeable and kind. And you're right about stack exchange :)
Thank you again for your help. I'll appreciate that.
I couldn't find the thread in which you claim to be blamed for asking simple questions (which are in my opinion not bad). But you shouldn't be blamed for those, there is no shame in asking simple questions provided that you show you have put an effort in the topic yourself.
Here on this thread: C: Sending files to Galaxy
Maybe that wasn't a blame and I was just sensitive. I will use my real account again.
I think you misinterpreted that comment, I see it as a helpful suggestion.
After scaling the normalized counts, my heatmap has changed significantly.
yea, it's just trial and error. i ran into the exact same issue when i started with this type of analysis. happy to help.
Sorry, I just noticed that if I transpose my matrix of read counts and then scale them, the result differs from when I don't transpose my matrix. Why have you transposed the matrix? Updated I found the answer. You can either scale the matrix by row or column. "pheatmap" package in R can scale the matrix and draw the heatmap.