Hello, I want to check the heatmap for a number of genes but I am seeing all genes don't change through samples, ie. all genes have the same color across samples (I can only see color difference for the grouped genes).
What I do is:
I have RPKMS, this is a little part of the matrix:
Sample08_tumor Sample11_control Sample12_tumor Sample13_control
ENSG00000000938 21.785479 46.963138 13.277114 26.846941
ENSG00000001460 17.751131 13.812688 5.310846 10.325746
ENSG00000001461 96.017482 55.250750 71.696417 90.866568
ENSG00000002016 25.012958 24.862838 30.537363 32.009814
ENSG00000002079 0.000000 0.000000 0.000000 0.000000
ENSG00000002587 3.227478 8.287613 3.983134 3.097724
Sample14_tumor Sample15_control Sample16_tumor
ENSG00000000938 22.563645 25.073894 56.586824
ENSG00000001460 8.122912 4.178982 4.438182
ENSG00000001461 137.186963 121.190490 114.283193
exprs<-as.data.frame(lapply(exprs, FUN = function(x) {sapply(x, FUN = log2)}))
exprs[exprs=="-Inf"]<-0
##now cluster:
heatmap.2(as.matrix(exprs),distfun = function(x) dist(x,method = 'euclidean'),hclustfun = function(x) hclust(x,method = 'average'),tracecol=NA)
Can anybody tell me why all genes don't change across samples? (I'm seeing different colors in bands, but not squares of colors)
Thank you.
That happens, due to the large amount of variation in the RPKM values. So to resolve this problem you can transform RPKM values into log10 and can use it for heatmap.
I'm not sure that just logging the RPKM data is sufficient, or even using such counts. RPKM and FPKM are not suitable for cross-sample comparisons because they do'nt adjust for differences in library sizes. Logging the data will neither take this information into account.
Pin.Bioinf, te puedo preguntar / may I ask you from where you obtained this data?
RPKM/FPKM do account for library size differences, the main problem is that they are not very good at accounting for differences in the fragment diversity and the relative expression changes within the fragment pools. TPMs will be more suitable for that. More info
I'm not sure what you mean by that.
Generally, you may want to consider using
scale = "row"
with your command. As a side note, I tend to findpheatmap
package a lot more pleasent to interact with when optimizing a heatmap visualization. It offers all the functionalities ofheatmap.2
, but with more sensible default settings and more intuitive parameter names.I'm reading the
more info
link, and there's a statement there that I'm unable to understand:How does dividing by a million (a constant number) *normalize for sequencing depth" (a variable between experiments)?
That's just not true... RPKM/FPKM do not adjust for total library size [edit] across all samples
The formula is here:
[source: https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/]
They have the total aligned reads (to all transcripts) as denominator, multiplied by L (gene length). As the numerator they have 10 to the power of 9 multiplied by the reads aligned to the gene in question. This is not doing any sort of adequate adjustment for total library size.
It's about time we end the use of FPKM, RPKM, FPKM-UQ... for good. Incorrect conclusions are being drawn from these types of data by people who are unaware of their failings.
I agree, it's not doing a good job. But it was meant to adjust for library sizes and it is part of the formula.
I think that [ again, like in the other thread, my friend! :) ] everything said here is correct, but we are again viewing it from different angles!
There is indeed a library size parameter in the formula, as one can see. However, the normalisation is performed per sample, i.e., some normalisation is made for library size within each sample.
What FPKM/RPKM do not do is adjust for library size differences across all samples in a study. What this means is that, for example, a FPKM value of 100 for geneX in Sample1 may actually reflect less expression than an equivalent value of 60 in Sample2 for the same gene. This happens if they are both sequenced to different depths of coverage and therefore have different library sizes.
What worries me is that people from broad fields are now transitioning into bioinformatics and they do not understand these issues. Just the other day, I saw a presentation where a medical doctor / physician showed some sample data and have plotted FPKM values. In the worst scenario, s/he will make a false conclusion about a particular gene if it was sequenced to different depths across her/his samples.
I agree, FPKM should never be used. But as long as Cufflinks exists and spits those out, you will see them.
@Friederike, you rock!
thanks, appreciated. but no worries, I generally don't need to be mollified during a professional discussion :)
This data was obtained by another bioinformatician who used to work for my colleague, and now she is asking I do some more analyses on it