calculating the fpkm from htseq counts
1
0
Entering edit mode
6.4 years ago

Hi everyone,

I am trying to calculating the fpkm values from the htseq-count result. I think I already get the gene.size values for each of the transcript, while the "dds" contains more rows than the gene.size since there are NR##### (non-coding RNAs) in the dds list.

When I was tryin tying to use

 mcols(dds)$basepairs <- gene.size
there is error codes: Error in `[[<-`(`*tmp*`, name, value = list(gene = 1:33398, length = c(6363L, : 33398 elements in value to replace 33420 elements

I am wondering if anybody can help with this. I am not sure if the dds and gene.size is the ordered in the same way! Many thanks!

Wei S

RNA-Seq R htseq fpkm • 2.7k views
ADD COMMENT
3
Entering edit mode

Are you sure FPKM is what you want? It's not a good normalization method.

ADD REPLY
0
Entering edit mode

I am a beginner of RNASeq. I already got the differentially expressed gene list from deseq2. Now, I am trying to get the expression value of all the genes, so I can do some other analysis just for the control. FPKM is the only the value I know to do this. I am not sure if there is other values I can utilize. Thanks a lot!

ADD REPLY
2
Entering edit mode

Just do counts(dds, normalized=TRUE) to access the normalised counts.

ADD REPLY
3
Entering edit mode

As per Wouter, it is a bad idea to use FPKM for differential expression comparisons. If you are looking to do downstream analyses from the DESeq2 counts, then obtain the regularised log or variance stabilised counts via rld() and vst(), respectively.

ADD REPLY
0
Entering edit mode

Really thanks a lot!!! I think I already did the counts(dds, normalized=TRUE) during the differentiate expression analysis, but does deseq2 just normalize to the total reading number of the library?

ADD REPLY
2
Entering edit mode

DESeq2 does indeed adjust for that ('library size') via the calculation of size factors. In making statistical inferences, it also models and adjusts for dispersion (see A: Clarification on how DSEeq2 Dispersion Curve is Generated ) and fold change differences on low count values.

ADD REPLY
0
Entering edit mode
6.4 years ago

While I often find it useful to use programs like edgeR / DESeq2 / limma-voom for p-value calculations, I would say it is also useful to have log2(FPKM + 0.1) values for visualization (QC plots, heatmaps, etc). While the ways to calculate gene length can vary, the log-transformed expression should show more of a normal distribution (at least per-gene) with varying methods of calculating gene length.

Also, you may sometimes find a gene is clearly differentially expressed (which you can see from the direct expression calculation) but not identified with at least one of the methods above, and there may be certain scenarios where calculating p-values with standard methods in R (such as aov() for ANOVA or lm() for linear-regression) using log-transformed FPKM values (or some other normalized expression value) can be a useful option in addition to the count-based methods.

Additionally, the 'edgeR' package has functions to calculate rpkm() and cpm(), which you could then log-transform to create your own figures (if used for a QC plot, this would be done without calculating a differential expression step to calculate a p-value using edgeR or edgeR-robust) .

ADD COMMENT

Login before adding your answer.

Traffic: 2140 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6