To What Degree Could We Determine Whether A Gene Is Expressed?
5
7
Entering edit mode
10.7 years ago
liupfskygre ▴ 210

Hi all, A question came to my mind when analysis gene expression using RNA-seq. After using DEseq2, a large number of genes were listed as differential expressed (e.g. FDR<0.01 or 0.05, absolute of log2 fold change >1). When took look at the FPKM of those genes, lots of them were very small. I think took all of them for further analysis is time consuming and also included a bunch of noises.

I saw a paper use the FPKM of one housekeeping gene they used to do QPCR as the cutoff. (http://aem.asm.org/content/79/7/2397.abstract)

In my opinion, use the mean FPKM (total FPKM of genes/gene number) of each condition as a cutoff/threshold might be a choice.

Any other thoughts on this issue are greatly appreciated!

fpkm rna-seq • 9.3k views
ADD COMMENT
0
Entering edit mode

Are you removing genes with very low overall read counts from your analysis?

ADD REPLY
10
Entering edit mode
10.7 years ago
Ryan Dale 5.0k

This 2013 BMC Genomics paper describes zFPKM, which they use to set a single threshold for yes/no expression across a bunch of experiments. Their methods are clear and straightforward to implement -- here's how you might calculate zFPKM in Python:

from scipy.stats import gaussian_kde
import numpy as np

# assume "fpkm" is a NumPy array of log2(fpkm) values.

kernel = gaussian_kde(fpkm)
xi = np.linspace(fpkm.min(), fpkm.max(), 100)
yi = kernel.evaluate(xi)
mu = xi[np.argmax(yi)]
U = fpkm[fpkm > mu].mean()
sigma = (U - mu) * np.sqrt(np.pi / 2)
zFPKM = (fpkm - mu) / sigma

Then you can choose a zFPKM value to use as your threshold. They chose zFPKM = -3 for human RNA-seq after comparing with active/repressive histone mod data.

ADD COMMENT
0
Entering edit mode

thank you! this is what I am looking for!

ADD REPLY
0
Entering edit mode

Hi Daler! I am new to programming and am having some trouble creating a NumPy array. When I use kernel=gaussian_kde(fpkm), I get this error:

>>> kernel=gaussian_kde(fpkm1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scipy/stats/kde.py", line 188, in __init__
    raise ValueError("`dataset` input should have multiple elements.")
ValueError: `dataset` input should have multiple elements.

I was wondering if you could please explain how to create a NumPy array from an Excel, .txt or .csv file? I appreciate your help so much!

ADD REPLY
0
Entering edit mode

It's impossible to tell what happened here without knowing how you created fpkm1. But you might want to take a look at the NumPy genfromtxt() docs or the Pandas IO tools documentation.

ADD REPLY
0
Entering edit mode

Okay thank you! I'll take a look at those.

ADD REPLY
0
Entering edit mode

In the original paper about zFPKM (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3870982) the authors discarded all values of log2(FPKM) that were lower than -15 before performing regression. Your function doesn't do this, and this reduces the accuracy of zFPKM calculation.

ADD REPLY
3
Entering edit mode
9.7 years ago
cankutcubuk ▴ 190

Hi,

Can I apply the zFPKM approach by RPKM data(zRPKM)?

Or which kind of normalization methods are compatible with this method?

By the way, here you can find the R script for zFPKM normalization.

I inspired from the python code which has given above.

install.packages("ks","pracma")

library(ks)
library(pracma)

/* fpkm is an example data */

fpkm <- c(1,2,3,4,5,6,7,8,4,5,
          6,5,6,5,6,5,5,5,5,6,
          6,78,8,89,8,8,8,2,2,
          2,1,1,4,4,4,4,4,4,4,
          4,4,4,4,3,2,2,3,23,2,
          3,23,4,2,2,4,23,2,2,
          24,4,4,2,2,4,4,4,2,2,
          4,4,2,2,4,2,45,5,5,5,
          3,2,2,4,4,4,4,4,4,4,4,
          4,3,2,2,3,23,2,3,23,4,
          2,2,4,23,2,2,24,4,4,2,
          2,4,4,4,2,2,4,4,2,2,4,
          2,45,5,5,5,3,2,2)

xi = linspace(min(fpkm),max(fpkm),100)
fhat = kde(x=fpkm,gridsize=100,eval.points=xi)

/* here I put digits=0. if I you do not round the numbers(yi) the results are a little bit changing.*/

yi = round(fhat$estimate,digits=0) 
mu = xi[which.max(yi)]
U = mean(fpkm[fpkm>mu])
sigma = (U-mu)* (sqrt(pi/2))
zFPKM = (fpkm - mu) / sigma

Cankut CUBUK
Computational Genomics Program - Systems Genomics Lab
Centro de Investigación Príncipe Felipe (CIPF)
C/ Eduardo Primo Yúfera nº3 46012 Valencia, Spain
http://bioinfo.cipf.es

ADD COMMENT
0
Entering edit mode

Hi,

I was trying to use your script in R, but maybe there are some problem with infinite values.

As far as I understood, the idea is use logFPKM as input, isn't it?

when I run the script I have this error:

Error en seq.default(x1, x2, length.out = n) :
  'from' cannot be NA, NaN or infinite

Maybe you can help me to resolve it? thanks!!

ADD REPLY
0
Entering edit mode

I already resolve the problem of infinite values, however the zFPKM that I obtained are all positive... so I consider that there is other problem...

ADD REPLY
1
Entering edit mode
10.5 years ago

I use log2(RPKM + 0.1) values for differential expression to minimize the number of low coverage genes with unreasonably high differential expression values.

You can see the relevant benchmarks for various rounding values here:

http://bioinfo.aizeonpublishers.net/content/2013/6/285-292.html

The portions of that paper comparing differential expression algorithms are also summarized in this blog post:

http://cdwscience.blogspot.com/2013/11/rna-seq-differential-expression.html

ADD COMMENT
0
Entering edit mode
10.7 years ago
jockbanan ▴ 440

From my (limited) experience: If you want more conservative results, try the newest approach from EdgeR package - function glmQLFTest(). It always gave me smaller number of significant genes than DEseq2 and classic EdgeR (in the form that is described in manual), as well as other tools like CLCBio. In addition, no or almost no genes reported by glmQLFTest() were tool-specific. It seems to report almost exactly genes reported by all the other tools I tried and no more, which makes me believe in these results a bit more. It may of course not work this way with your data and you will probably need to set some cutoff even if it will, but it is something to start with.

ADD COMMENT
0
Entering edit mode

thanks for this suggestion. I will try it later!

ADD REPLY
0
Entering edit mode
10.5 years ago
Michael Love ★ 2.6k

Can you clarify, is your question how to tell if a gene is expressed, or differentially expressed?

In the latest release of DESeq2, we have implemented functionality to test for differential expression above a critical threshold, as sometimes users are not interested in all statistically significant genes, but actually wanted to find large fold changes. This can be accomplished in with:

results(dds, lfcThreshold=x, altHypothesis="greaterAbs")

Where x is the critical threshold specified by the user. The fold changes are in the log2 space, so x=1 finds fold changes > 2 and < 1/2, x=2 finds fold changes > 4 and < 1/4, etc. Note that this will provide p-values testing the null hypothesis that |β| < x, which is not the same as testing the null of |β| = 0 and filtering out |β| < x.

ADD COMMENT
0
Entering edit mode

liupfskygre was hoping for a method to determine expression (presumably akin to that used by mas5 on affy arrays), mostly as a method for filtering DE results. Of course there's no great biological definition to go by there, so any answer would be a bit hand wavy. BTW, welcome to biostars.

ADD REPLY

Login before adding your answer.

Traffic: 1032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6