Hi,
Interested in converting some raw HT-counts to FPKMs and wondering how everyone approaches this problem? I have a few ideas but was interested in the consensus view. Thank you!
Rob.
Hi,
Interested in converting some raw HT-counts to FPKMs and wondering how everyone approaches this problem? I have a few ideas but was interested in the consensus view. Thank you!
Rob.
It's best not to, but if you really must, then a common approach is to take either the median transcript length or the length of the "union gene model" as the K
in the FPKM. Aside from that, it's counts / length (in KB) / 1 million
. Note that if you want to compare between samples, that you should use normalized counts, since FPKMs made from raw counts are inappropriate for comparison between samples (among the reasons it's best not to bother with FPKMs).
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Great thanks, any particular place its easiest to download list transcript lengths? I guess UCSC table browser?
I usually just calculate it from GTF files, but if you can get it from UCSC then all the easier :)
I usually do as well, just wondering if it was just somewhere in UCSC. I usually do the following in R:
library(GenomicFeatures) txdb <- makeTxDbFromGFF("test.gtf", format="gtf") trans <- transcripts(txdb, columns=c("GENEID")) df <- data.frame(gene=trans$GENEID, len=width(trans))
I am wondering though, given that HTseq outputs counts of genes irrespective of alternative transcripts, how best to pair the lengths from the R commands above with HTseq counts?
Thanks.
Hmm, won't that be the length of the transcript on the genome, rather than their transcribed lengths? I have an old script that will produce the "union gene model" length from a GTF file, which I suspect will be a bit more reasonable. That would also match better with what htseq-count and featureCounts are doing.
I believe you're right. Giving the script you linked a shot as well. Just so I understand completely what is background fasta you're using here? Also in this instance the GC length is being used as a proxy for transcribed length? Thanks for the help!
You can remove the GC related stuff and the fasta file, you don't need that. If you look at the README file in that directory you'll note that this was really made for CQN, which needs GC content.
Hi Devon, could you please explain a little more why is it best not to convert raw RNASeq counts to FPKM values? You also said "if you want to compare between samples, that you should use normalized counts, since FPKMs made from raw counts are inappropriate for comparison between samples"; what does the "normalized counts" stand for there? I was thinking FPKM are a kind of normalized counts.. Do you mean CPM, or generally, do you mean normalization for the library size?
Regarding FPKMs, the reasons behind this have been repeated so often then I won't bother doing so again. Please simply search this site for them.
Regarding "normalized counts", please search for "RNAseq normalized counts" with google. In essence, these are any counts resulting after correcting for library size in a robust manner (i.e., not FPKM or CPM).