I have FPKM values for a number of genes. I want to make a heatmap and find differentialy expressed genes. Here I have three questions:
- Should I use FPKM values directly to make a heatmap or is there anything which has to be done to FPKM values before using them to make a heatmap? I am using MeV to make a heatmap.
- How do I find differentially expressed genes here? I read that taking log2 values of FPKM and then setting the cutoff as 2 will give differentially expressed gene. Taking the cutoff as 2 sounds very arbitrary, is there any reason to take 2 as the cutoff?
- There are biological replicates also, each replicate has a specific FPKM value under each treatment. How do I treat these replicates?
Thanks
The "workflow" you describe for finding differentially expressed genes is not appropriate/optimal. A good starting point is the workflow from bioconductor for RNA-seq analysis, although it doesn't start from fpkm values.
Thats another problem. Most tools take count matrices as input.
So you only have FPKMs? No access to raw data?
yes no access to raw data. What best can i do with the FPKM values.
This sounds all to familiar like a class homework exercise.
Dear Kevin,
I'm a bit confused and hoping you can clear things up for me.
I was trying to make a heatmap of log2-transformed FPKM values but the plots do not make sense to me as genes do not change with conditions and biological replicates do not cluster (I can share code and data if you want). Is this because log2-transformed FPKM values are not comparable between samples? https://ibb.co/m5oLwG http://tinypic.com/r/acdfea/9
To give some background I'm making 3 heatmaps, one each for a different cancer cell line. For each of these cell lines they were exposed to multiple treatments: various chemicals (n=5) compared to the control (DMSO) as well as a number of shRNA (n=4) compared to the control (shCTR).
Users like ashish and WouterDeCoster are not being very helpful by suggesting another workflow (e.g. DESeq2). To make my point clear I just started a new job, and the sequencing facility has given me cuffdiff results so that is what i have to work with!
I know DESeq2 (and other methods which use the generalized linear model) are appropriate for more complex experimental designs as they can deal with multiple factors (explanatory variables) – e.g. experiments involving multiple treatments (my experimental design) and time-points (time course analysis). I told my boss this. The sequencing facility in the future will give me STAR alignments so I can do DESeq2 in the future (i.e. next project). However, currently all I have is these FPKM values and I need to make some figures which is holding back publication of a manuscript of one of the post-docs in our lab.
Okay so to clear things up if I feed my FPKM values into Limma it will "fit a linear model to your data and adjusts using empirical bayes" which would allow me to make some heatmaps and see replicates clustering and differential expression between samples?
Many thanks for the help.
Hi moldach,
Yes, I have seen heatmaps like yours in the past - the data distribution does not look optimal, but this is related to a problem with FPKM distribution, generally. As far as I understand, you took the significant genes from CuffDiff and then generated the heatmap from these?
Could you just try 2 more heatmaps with the following code (and note to use unlogged FPKM values):
I will certainly give this a try Kevin, and thank you, but first I need some clarification on where to get
MySignificantGenes
from. Is this found in the gene_exp.diff file output from CuffDiff? If so do I just filter by the column significant?I guess the part that confuses me is that there are a lot of comparisons comparisons going on (e.g. DMSO_1h vs. DMSO_6h, DMSO_1h vs. JQ1_1h, DMSO_1h vs JQ1_6h, DMSO_1h vs. JQMT_1h, DMSO_1h vs. JQMT_6h, DMSO_1h vs. MTX_1h, DMSO_6h vs. JQ1_1h, shCTR vs. shBRD4, shBRD4 vs. shMTHFD1, etc.).
My concern is that since CuffDiff can only do pairwise comparisons perhaps this is somehow not meaningful to look at?
Assuming I do use the gene_exp.diff file (6Gb) this is the following code I've run before I get some weird result:
So when I get to the
heat <- t(scale(t(MyUnLoggedFPKMValues)))
mynrow(heat)
is[1] 55939
. However, I'm getting weird results here:I compared gene names from
sig
and fromheat
to confirm that the genes are present fromsig
inheat
. Not sure why I'm getting this error?Yes, but it would be extremely difficult to extract just a single P value for each gene. That is, you have many conditions and time points and would require a more complex form of modelling, coupled with a very large sample size. Cuffdiff is mainly for comparisons between just two conditions.
In your situation, take the top 10 or 20 genes from each comparison, merge these into a vector and remove any duplicate genes that may have been significant in more than one comparison, and then use this list of significant genes to subset your FPKM matrix and to generate your heatmap (with FPKM values that you'll scale yourself using my code above). You should then see good separation. I would not drift away from the pairwise comparisons that have already been done - it may very well be the best way to analyse the data that you've got.
PS - if not implied in the CuffDiff files, it would be beneficial to find out if the P values have been adjusted by FDR!
Yes this is pretty much what I was thinking :(
Please note that I modified my comment above sometime ~during your answer to show what I've worked through so far on the code.
Your suggestion seems reasonable but a bit tricky for me to implement - I'll give it a try though. Most of the data.frame manipulation I've done to-date has been on small files in R (using dplyr for example) so I'm having a bit of difficulty trying to figure out how to do this in linux.
There are 5 chemical treatments (each with two time periods) and 4 shRNA treatments. So I did the pairwise comparison calculations for how many I need to do for each of the cell lines. PCs=[k(k-1)]/2. So [1413]/2=91
Let me think this out through pseudo-code now: 1)
cut -f 3,5,6,10,14 gene_exp.diff # filter out gene, sample1, sample2, log2(fold-change), and significant columns
2)
awk '$5 ~ /yes/ {print $0}' edit-degs.diff > degs.csv # Filter out only those genes which were significant in any pairwise comparison
3) Create 51 smaller datasets (one for each pairwise comparison) - I'm sure there has to be a more elegant way to do this?
4) Sort each of these datasets by log2(fold-change) (or FDR p-value) and grab the top 20.
5) Merge these into a vector
v1 <- merge(pc1, pc2)
thanv2 <- merge(v1, pc3)
, etc. 51 times or use plyrv1 <- join_all(pc1, pc2, ...., pc51)
Oh yeah and about the pairwise comparisons that have already been done well that's just the thing...I don't know/think they've been done yet. I certainly can't access them from the .html and can't see them in SCRATCH folder (just the cuffdiff_global folder). I guess that's a question I'll have to ask my sequencing facility.
If you already have the Cuffdiff results, then you already have differentially expressed genes.
To clarify things I used FPKM values from CuffLinks output not CuffDiff. Here is an image from the sequencing facility: http://i67.tinypic.com/qpj86u.jpg Specifically I used Genes (Symbols) which is the Gene FPKM file just with
ensembl_gene_names
added.The CuffDiff results seen here http://i65.tinypic.com/2ic0jle.png are from 816 pair-wise comparisons. The first week I started here none of these links were clickable and I didn't have access to the cluster to look at files in SCRATCH. Now that I have cluster access I can only see a cuffdiff_global folder http://i67.tinypic.com/ipun7s.png but no individual pairwise comparison results. Should I be using the genes.fpkm_tracking file instead?