Hello,
I'm totally new in bioinformatics and have learned by myself with tuto's and guides during the last few months.
My experiment is a RNA-Seq profiling of diurnal cultures : I have 2 strains (WT and mutant) and 3 biological replicates for each, with sampling every 4h for 28h, which makes 8 time points (0 - 4 - 8 - 12 - 16 - 20 - 24 - 28). Thus making a total of 48 sets of data.
I already did the whole process of mapping the reads and counting them to end up with both raw counts and FPKM. I'm using the FPKM log2-fold change values with MapMan. And I have started to use the counts for a Differential Gene Expression analysis in R with DESeq2. My goal is to analyze how the transcriptome (17741 transcripts) is evolving during a day/night cycle but also to identify and analyze the differences between my 2 strains.
I already built the DESeqDataSets and DESeqResults with a complete design (design=~strain + time + strain:time) and with a reduced design : (design=~strain) or (design=~time) But that's where I'm a bit lost, despite reading any manual I could find. I don't get what I will get for results, what I should do with those values, how I can use them for graphics, how to chose which test is more relevant, ...
I probably didn't expose my situation clearly enough for you specialists, but any help would be so much appreciated!
Thanks already! RĂ©mi
I'm not 100% sure I completely understand your question, but it seems like you want to find significantly differentially expressed transcripts from timepoint to timepoint but also wild-type to mutant. Starting with the latter because that's much more straightforward, you can recover differentially expressed genes between each condition at each timepoint (eg at 0 hours, genes X,Y,Z are up/down in mutant compared to control with pvalues ...). Then you can plot the FC by FPKM or FC by -pvalue in an MA plot or Volcano plot respectively. For the timeseries, DE testing is a little trickier because there's a lot of comparisons you can make. A clustering strategy then might be more informative to identify groups of genes that move together through the time series.
Hi Jacob,
Sorry for that late reaction and thanks for your comment! You perfectly understood my situation : i would like to find significantly differentially expressed transcripts over the time (48h in my case) but i also would like to compare wild-type results to mutant. I now get that I can't do both in the same analysis, can I? As you said, the most difficult will be for the timeseries. How should I do for clustering my transcripts?
Here's what I've done so far, maybe it will give you better information than when I try to describe it:
[1] "Intercept" "strain_mutant_vs_complemented" "time_12h_vs_0h" "time_16h_vs_0h"
[5] "time_20h_vs_0h" "time_24h_vs_0h" "time_28h_vs_0h" "time_4h_vs_0h"
[9] "time_8h_vs_0h" "strainmutant.time12h" "strainmutant.time16h" "strainmutant.time20h"
[13] "strainmutant.time24h" "strainmutant.time28h" "strainmutant.time4h" "strainmutant.time8h"
class: DESeqDataSet dim: 6 46 metadata(1): version assays(3): counts mu cooks rownames(6): Cre01.g000017.v5.5 Cre01.g000033.v5.5 ... Cre01.g000150.v5.5 Cre01.g000200.v5.5 rowData names(77): baseMean baseVar ... deviance maxCooks colnames(46): B0_1 B0_2 ... L28_2 L28_3 colData names(5): strain time strain_time phase sizeFactor
out of 17021 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 2203, 13% LFC < 0 (down) : 2320, 14% outliers [1] : 0, 0% low counts [2] : 1976, 12% (mean count < 2) [1] see 'cooksCutoff' argument of ?results [2] see 'independentFiltering' argument of ?results
I can think of a couple ways to do this but I'm not sure which is best. The first is to make the lists of DE genes for your comparisons in the WT data and mutant data. The comparison are up to you: as suggested by Constantine they can be all relative to 0 (0:4, 0:8, and so on) or between timepoints (0:4, 4:8 and so on). I also usually cutoff my DE lists (abs(FC) > 2, p-value < 0.05). So your DE lists for the WT and mutant would be all genes that are DE, in any one of the comparisons and passing the filter. Then, you can compare these two lists with a Venn diagram to see the genes that are DE in both datasets/ unique to one dataset etc.
The second way is to go the clustering route. There's a lot of clustering methods. For example using hierarchical clustering and cutting the dendrogram will usually recover time-profiles well enough. I wrote a tutorial for that here. Another option is using fuzzy means clustering with Mfuzz. I would cluster the data in both datasets independently and then compare cluster assignments. You could use a contingency table and fischers exact test to checkout the overlap.