Question

To What Degree Could We Determine Whether A Gene Is Expressed?

7

Entering edit mode

10.7 years ago

liupfskygre ▴ 210

Hi all, A question came to my mind when analysis gene expression using RNA-seq. After using DEseq2, a large number of genes were listed as differential expressed (e.g. FDR<0.01 or 0.05, absolute of log2 fold change >1). When took look at the FPKM of those genes, lots of them were very small. I think took all of them for further analysis is time consuming and also included a bunch of noises.

I saw a paper use the FPKM of one housekeeping gene they used to do QPCR as the cutoff. (http://aem.asm.org/content/79/7/2397.abstract)

In my opinion, use the mean FPKM (total FPKM of genes/gene number) of each condition as a cutoff/threshold might be a choice.

Any other thoughts on this issue are greatly appreciated!

fpkm rna-seq • 9.3k views

ADD COMMENT • link updated 17 months ago by shelkmike ★ 1.4k • written 10.7 years ago by liupfskygre ▴ 210

0

Entering edit mode

Are you removing genes with very low overall read counts from your analysis?

ADD REPLY • link 10.6 years ago by Biomonika (Noolean) 3.2k

Ram · Answer 1 · 2014-03-06

10

Entering edit mode

10.7 years ago

Ryan Dale 5.0k

This 2013 BMC Genomics paper describes zFPKM, which they use to set a single threshold for yes/no expression across a bunch of experiments. Their methods are clear and straightforward to implement -- here's how you might calculate zFPKM in Python:

from scipy.stats import gaussian_kde
import numpy as np

# assume "fpkm" is a NumPy array of log2(fpkm) values.

kernel = gaussian_kde(fpkm)
xi = np.linspace(fpkm.min(), fpkm.max(), 100)
yi = kernel.evaluate(xi)
mu = xi[np.argmax(yi)]
U = fpkm[fpkm > mu].mean()
sigma = (U - mu) * np.sqrt(np.pi / 2)
zFPKM = (fpkm - mu) / sigma

Then you can choose a zFPKM value to use as your threshold. They chose zFPKM = -3 for human RNA-seq after comparing with active/repressive histone mod data.

ADD COMMENT • link 10.7 years ago by Ryan Dale 5.0k

0

Entering edit mode

thank you! this is what I am looking for!

ADD REPLY • link 10.7 years ago by liupfskygre ▴ 210

0

Entering edit mode

Hi Daler! I am new to programming and am having some trouble creating a NumPy array. When I use kernel=gaussian_kde(fpkm), I get this error:

>>> kernel=gaussian_kde(fpkm1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/kde.py", line 188, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.

I was wondering if you could please explain how to create a NumPy array from an Excel, .txt or .csv file? I appreciate your help so much!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by kdixo100 • 0

0

Entering edit mode

It's impossible to tell what happened here without knowing how you created fpkm1. But you might want to take a look at the NumPy genfromtxt() docs or the Pandas IO tools documentation.

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.2 years ago by Ryan Dale 5.0k

0

Entering edit mode

Okay thank you! I'll take a look at those.

ADD REPLY • link 10.2 years ago by kdixo100 • 0

0

Entering edit mode

In the original paper about zFPKM (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3870982) the authors discarded all values of log2(FPKM) that were lower than -15 before performing regression. Your function doesn't do this, and this reduces the accuracy of zFPKM calculation.

ADD REPLY • link 17 months ago by shelkmike ★ 1.4k

Ram · Answer 2 · 2015-02-06

Hi,

Can I apply the zFPKM approach by RPKM data(zRPKM)?

Or which kind of normalization methods are compatible with this method?

By the way, here you can find the R script for zFPKM normalization.

I inspired from the python code which has given above.

install.packages("ks","pracma")

library(ks)
library(pracma)

/* fpkm is an example data */

fpkm <- c(1,2,3,4,5,6,7,8,4,5,
          6,5,6,5,6,5,5,5,5,6,
          6,78,8,89,8,8,8,2,2,
          2,1,1,4,4,4,4,4,4,4,
          4,4,4,4,3,2,2,3,23,2,
          3,23,4,2,2,4,23,2,2,
          24,4,4,2,2,4,4,4,2,2,
          4,4,2,2,4,2,45,5,5,5,
          3,2,2,4,4,4,4,4,4,4,4,
          4,3,2,2,3,23,2,3,23,4,
          2,2,4,23,2,2,24,4,4,2,
          2,4,4,4,2,2,4,4,2,2,4,
          2,45,5,5,5,3,2,2)

xi = linspace(min(fpkm),max(fpkm),100)
fhat = kde(x=fpkm,gridsize=100,eval.points=xi)

/* here I put digits=0. if I you do not round the numbers(yi) the results are a little bit changing.*/

yi = round(fhat$estimate,digits=0) 
mu = xi[which.max(yi)]
U = mean(fpkm[fpkm>mu])
sigma = (U-mu)* (sqrt(pi/2))
zFPKM = (fpkm - mu) / sigma

Cankut CUBUK
Computational Genomics Program - Systems Genomics Lab
Centro de Investigación Príncipe Felipe (CIPF)
C/ Eduardo Primo Yúfera nº3 46012 Valencia, Spain
http://bioinfo.cipf.es

score 1 · Answer 3 · 2014-05-02

I use log2(RPKM + 0.1) values for differential expression to minimize the number of low coverage genes with unreasonably high differential expression values.

You can see the relevant benchmarks for various rounding values here:

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

The portions of that paper comparing differential expression algorithms are also summarized in this blog post:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

score 0 · Answer 4 · 2014-03-06

From my (limited) experience: If you want more conservative results, try the newest approach from EdgeR package - function glmQLFTest(). It always gave me smaller number of significant genes than DEseq2 and classic EdgeR (in the form that is described in manual), as well as other tools like CLCBio. In addition, no or almost no genes reported by glmQLFTest() were tool-specific. It seems to report almost exactly genes reported by all the other tools I tried and no more, which makes me believe in these results a bit more. It may of course not work this way with your data and you will probably need to set some cutoff even if it will, but it is something to start with.

Ram · Answer 5 · 2014-05-02

Can you clarify, is your question how to tell if a gene is expressed, or differentially expressed?

In the latest release of DESeq2, we have implemented functionality to test for differential expression above a critical threshold, as sometimes users are not interested in all statistically significant genes, but actually wanted to find large fold changes. This can be accomplished in with:

results(dds, lfcThreshold=x, altHypothesis="greaterAbs")

Where x is the critical threshold specified by the user. The fold changes are in the log2 space, so x=1 finds fold changes > 2 and < 1/2, x=2 finds fold changes > 4 and < 1/4, etc. Note that this will provide p-values testing the null hypothesis that |β| < x, which is not the same as testing the null of |β| = 0 and filtering out |β| < x.