Question

Heatmap based with FPKM values

5

Entering edit mode

7.7 years ago

Mehmet ▴ 820

Dear All,

I have a data set of 17 samples with FPKM values. In the data set I have ~800 differentially expressed genes some of which have zero FPKM value.

I would like to ask you about generating heatmap.

Can I use FPKM values directly for generating heatmap?
Do I have to perform additional analysis with FPKM values prior to generating heatmap?
According to your experience, which genes (based on FPKM values) should I use in heatmap analysis?

Thank you

RNA-Seq next-gen R • 23k views

ADD COMMENT • link updated 6.2 years ago by Biostar 20 • written 7.7 years ago by Mehmet ▴ 820

2

Entering edit mode

An update (6th October 2018):

You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:

Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis

The Total Count and RPKM [FPKM] normalization methods, both of which are still widely in use, are ineffective and should be definitively abandoned in the context of differential analysis.

Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units

The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments. This is a result of RNA-Seq being a relative measurement, not an absolute one.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

Kevin Blighe · Answer 1 · 2017-11-19

10

Entering edit mode

7.7 years ago

Kevin Blighe 89k

Yes, you can begin with FPKM values but you will have to transform these values and also filter the dataset with your 800 differentially expressed genes. On that note, 800 is a lot of genes. Try to increase your cut-offs for statistically significantly differentially expressed. Try things like:

FDR Q<0.05
FDR Q<0.01
FDR Q<0.001
FDR Q<0.0001

et cetera.

Use this code (below).

Your FPKM values will be stored in MyFPKMValues
DiffExpressedGenes will comprise a single vector of genes that are differentially expressed
zFPKM package will be used to convert your FPKM values to the Z-scale prior to clustering.

---------------------------

Set colour and heatmap scaling breaks

require("RColorBrewer")
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-3, 3, length.out=101)

Scale the FPKM values to the Z scale

library(zFPKM)
heat <- zFPKM(MyFPKMValues)

Filter your dataset to include your differentially expressed genes:

heat <- heat[which(rownames(heat) %in% DiffExpressedGenes), ]

Generate heatmaps with Euclidean distance (first) and '1 - Pearson correlation' distance (second) (both use Ward's linkage)

require("gplots")

#Euclidean distance
heatmap.2(heat,
  col=myCol,
  breaks=myBreaks,
  main="Title",
  key=T, keysize=1.0,
  scale="none",
  density.info="none",
  reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
  trace="none",
  cexRow=0.2,
  cexCol=0.8,
  distfun=function(x) dist(x, method="euclidean"),
  hclustfun=function(x) hclust(x, method="ward.D2"))

#1-cor distance
heatmap.2(heat,
  col=myCol,
  breaks=myBreaks,
  main="Title",
  key=T, keysize=1.0,
  scale="none",
  density.info="none",
  reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
  trace="none",
  cexRow=0.2,
  cexCol=0.8,
  distfun=function(x) as.dist(1-cor(t(x))),
  hclustfun=function(x) hclust(x, method="ward.D2"))

To each heatmap command, you can add ColSideColors, which is a vector of colours for a condition of interest, such as case/controls. The order of this colour vector has to match the order of samples in your 'heat' object that you pass to heatmap.2

Note that converting to the Z scale is not exclusive: These guys median-centered their FPKM data and then log (base 2) transformed them prior to heatmap generation

ADD COMMENT • link 4.9 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

In this code:

heat <- heat[which(rownames(heat) %in% DiffExpressedGenes), ]

what is the DiffExpressedGenes) ?

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

0

Entering edit mode

In my FPKM data,

some genes have 0 (zero) FPKM values in some samples. Is it okay to remove those genes?

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

1

Entering edit mode

DiffExpressedGenes is a vector that contains the names of your differentially expressed genes. These names should all be part of the rownames of your FPKM matrix.

I believe you can include FPKM values that are 0. Why? - due to the fact that we scale / transform the data to the Z-scale, these values will return as -4 or -6 (or something else that is negative). You cannot have any NAs in your data, but zeros are okay.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, Thank you.

I created GeneIDs file, and I used heatmap.2(heat, col=myCol, breaks=myBreaks, main="Title", key=T, keysize=1.0, scale="none", density.info="none", reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean), trace="none", cexRow=0.2, cexCol=0.8, distfun=function(x) dist(x, method="euclidean"), hclustfun=function(x) hclust(x, method="ward.D2"))

but I received this error:

Error in heatmap.2(heat, col = myCol, breaks = myBreaks, main = "Title",  :

`x' must have at least 2 rows and 2 columns

this is my heat file:

Stage1 Stage2 Stage3  Stage17

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

1

Entering edit mode

It should contain genes as rows and samples as columns. The values should be numerical.

Most likely, your filtering of differentially expressed genes has been incorrect.

What are the contents of DiffExpressedGenes and how did you obtain these?

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

I printed the first column that has gene names using awk command.

the content of DiffExpressedGenes:

      V1
1 Gene_1303900.1
2 Gene_1038800.1
3 Gene_1579700.1
4 Gene_1031800.1
5 Gene_1031700.1
6 Gene_1579600.1

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

0

Entering edit mode

Hi Kevin,

Do you mean that the DiffExpressedGenes must have only gene names and the MyFPKMValues must have both gene names in each sample with FPKM values? If yes, my files are so.

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

0

Entering edit mode

Yes, that should be the situation. I am worried about the '.1' at the end of your gene names in DiffExpressedGenes, though. Are these names exactly as they appear in the rownames for MyFPKMValues?

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Yes, they are. I picked gene names from the MyFPKMValues.

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

1

Entering edit mode

Wait, DiffExpressedGenes looks like a data-frame.

Try this instead for the filtering step:

heat <- heat[which(rownames(heat) %in% DiffExpressedGenes[,1]), ]

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Now OK. but I received this error:

Error in hclust(x, method = "ward.D2") :

NA/NaN/Inf in foreign function call (arg 11)

Zero values have caused this error?How to deal with this error?

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

0

Entering edit mode

Hi Kevin,

OK. I omitted NaN values using: heat <- na.omit(heat)

Then I generated heatmap.

I would like to have your advice. In my FPKM data I have ~800 genes. These genes are effector genes (composed of three different gene families) and have FPKM values in each stage. To decrease number of genes to make heatmap more visible, what should I do? You mentioned to use FDR or log2 values, but I just have FPKM values. Is it okay to remove some genes that are under a threshold like 100 FPKM?

by the way, Thank you very much for your help at both PCA and heatmap post.

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

1

Entering edit mode

No problem, Mehmet.

From where did you obtain these FPKM values? Do you want to conduct differential expression analysis on them?

You could remove low FPKM genes - sure. You could also remove genes that have low variance (probably better).

Using something like this, you can identify which genes fall in the bottom half of the variance range:

geneVariances <- apply(MyFPKMValues, 1, var), na.rm=TRUE)

maxVariance <- max(geneVariances)

minVariance <- min(geneVariances)

cutoffVariance <- maxVariance - ((maxVariance - minVariance) / 2)

which(geneVariances<cutoffVariance)

The final line of code (which(geneVariances<cutoffVariance)) should return the indices of the genes that fall in the lower half of variance in your dataset.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

I generated from cuffnorm tool.

According to your advice I will remove genes that are produced from (which(geneVariances<cutoffVariance))

ADD REPLY • link 7.7 years ago by Mehmet ▴ 820

2

Entering edit mode

You can try CuffDiff to obtain a P value (if you even have samples that you want to compare via differential expression?).

The which() command will just return indices of genes, which you an use to filter.

Also be aware that Cufflinks is not supported anymore. You should aim to use HISAT2/StringTie the next time.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,,

I want to ask one more thing. How to show sample names in heatmap based on orders in FPKM data?

when I draw heatmap, sample order has changed. How can I keep the same order with the original file?

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

1

Entering edit mode

Hi Mehmet, wait, is this for ComplexHeatmap (from our other thread: how to cluster genes in heatmap ) or heatmap.2 (gplots)?

For ComplexHeatmap, just supply row_order or column_order as extra parameters to the Heatmap() function. These just have to be vectors of rownames or column names (that match your data matrix). ComplexHeatmap will then fix the order based on these.

For heatmap.2, you can maintain order as per your data-matrix by configuring RowV and ColV, e.g., RowV=FALSE, ColV=TRUE, will cluster your columns but the rows (genes) will be ordered as they appear in your supplied data-matrix.

ADD REPLY • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

For complexHeatmap:

I tried but I could not do.

What I did is: 1. I generate a table that has sample names

sample1
sample2
sample3

2. saved the table as "samplenames" 3. I added row_names = "samplenames" in heatmap function.

The order on the right side of the heatmap is the correct order based on my data.matrix.

What I want is to make the same order in the column order below the heatmap as it is in my data.matrix

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

1

Entering edit mode

Hi mate, no, samplenames should just be a vector. You can create it like this:

samplenames <- c("Sample1", "Sample2", ..., "Sample X")

or

samplenames <- colnames(ExpressionMatrix)

samplenames should not be a data-frame, even if it's a single column data-frame.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

I did but column names below the heatmap did not ordered as they are in the data.matrix. The annotation side is ordered as they are in the data.matrix.

What I did is:

samplenames <- c("Sample1", "Sample2", ..., "Sample X")

and

column_order = samplesnames

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

1

Entering edit mode

Oh, sometimes, because there are so any parameters in ComplexHeatmap, one parameter 'over-rides' another. Try also:

cluster_columns=FALSE and cluster_rows=FALSE

ADD REPLY • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

I did, but it did not worked.

I run this:

hmap <- Heatmap(heat, name="Z-score",col=colorRamp2(myBreaks, myCol), column_order = samplesnames, cluster_columns = FALSE, heatmap_legend_param=list(color_bar="continuous", legend_direction="horizontal", legend_width=unit(3,"cm"), title_position="topcenter", title_gp=gpar(fontsize=10, fontface="bold")), split=df$EffectorName, cluster_rows=TRUE, show_row_dend=TRUE, row_title="Transcript", row_title_side="left", row_title_gp=gpar(fontsize=5, fontface="bold"), show_row_names=TRUE, row_names_side="left", row_title_rot=0, cluster_columns=TRUE, show_column_dend=TRUE, column_title="Samples", column_title_side="top", column_title_gp=gpar(fontsize=10, fontface="bold"), column_title_rot=0, show_column_names=TRUE, clustering_distance_columns=function(x) as.dist(1-cor(t(x))), clustering_method_columns="ward.D2", column_dend_height=unit(5,"mm"), clustering_distance_rows="euclidean", clustering_method_rows="ward.D2", row_dend_width=unit(5,"mm"), top_annotation_height=unit(0.5,"cm"), top_annotation=ColAnn, bottom_annotation_height=unit(1, "cm"), bottom_annotation=boxAnnCol)

this is the error:

    Error in Heatmap(heat, name = "Z-score", col = colorRamp2(myBreaks, myCol),  : 
  formal argument "cluster_columns" matched by multiple actual arguments

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

1

Entering edit mode

Oh, you have cluster_columns mentioned twice. The first mention of it, it's TRUE, and the second it's FALSE.

Take a closer look at your code for the Heatmap function. You'll see it there twice.

ADD REPLY • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

I know I am asking too much help from you. I really appreciate for your help.

I wrote cluster_columns = TRUE but not changed order of samples. I also tried cluster_columns = FALSE but I received this error:

Error in if (sum(l1)) grid.points(x = rep(i, sum(l1)), y = x[, i][l1], : argument is not interpretable as logical Error when drawing annotation 'boxplot' Error in .local(object, ...) : Error in if (sum(l1)) grid.points(x = rep(i, sum(l1)), y = x[, i][l1], : argument is not interpretable as logical

It produced a graph with same order in matrix.data, except annotations, but a sample was NA.

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

0

Entering edit mode

Can you paste a sample of your data?

ADD REPLY • link 7.6 years ago by Kevin Blighe 89k

0

Entering edit mode

FamilyName GeneID Egg_Ir12 Egg_Ir13 L2_Ir14 L2_Ir15 L3_Ir200 L3_Ir201 L4_Ir203 L4_Ir202 FA_Ir205 FA_Ir204 MA_Ir206 MA_Ir207 D3_Ir1279 D3_Ir1278 D4_Im1215 D4_Im1210 D4_Im1214 
GH16 gene_1303900.1 25.7847 19.7103 22.6148 26.5067 25.0661 23.3288 25.0834 24.1146 111.904 116.042 29.1163 27.7907 22.3014 21.3545 66.5493 64.4022 53.0803
GH16 gene_1038800.1 2.34831 2.95942 6.84916 7.94771 29.7827 28.9562 53.2878 48.1816 84.8273 76.8617 68.5481 57.4876 41.1048 37.7387 4.80982 0 0
GH16 gene_1579700.1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4.72954 0 0
GH16 gene_1031800.1 0.584824 0 0 0 0.726056 1.4398 5.05124 2.66249 0.576734 1.89816 0.894391 2.67472 0 0 0

========================================================================

ADD REPLY • link updated 7.6 years ago by Kevin Blighe 89k • written 7.6 years ago by Mehmet ▴ 820

0

Entering edit mode

I am trying to make the data understandable, but I do not know how to do.

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

0

Entering edit mode

GH16 is FamiliyName. Egg, L2, L3 ... sample name

ADD REPLY • link 7.6 years ago by Mehmet ▴ 820

0

Entering edit mode

Ensure that there are no NAs in your data. Also, your gene 'gene_1579700.1' has virtually no data...

You may want to post a new question if you still have problems.

For clustering, a data-matrix with no NA values must be used. Also, for the distance metrics to function properly, too many zero values will cause problems.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi, Kevin , I followed the same code, but for heatmap2(heat,...) , I got an error message saying x should be numeric. When tried heat with default heatmap2, I got the same error message. Please help me. my heat is a data frame gene names are row names and zFPKM are values. How can I fix it? Thx .

ADD REPLY • link 5.0 years ago by Aynur ▴ 60

0

Entering edit mode

Hi, you just need data.matrix(heat)

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks, so I got this heatmap, but row names are not visible. Also, how can I get rid of that zip zag lines? How can I make row names visible? I also want to add a dendrogram. Thanks
para kasa resmi

ADD REPLY • link 5.0 years ago by Aynur ▴ 60

0

Entering edit mode

This is a consistent problem with these old heatmap functions, unfortunately. If you have time, you could take a look at ComplexHeatmap, where this issue is tackled. I have a tutorial that you can follow: https://github.com/kevinblighe/E-MTAB-6141

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Actually, in the latest versions of gplots::heatmap.2(), if you just increase the value of cexRow, it should place them such that no names overlap (meaning that some names will also be excluded from the final figure).

ADD REPLY • link 5.0 years ago by Kevin Blighe 89k

0

Entering edit mode

the zFPKM link (https://bioconductor.org/packages/release/bioc/html/zFPKM.html) is down.