Hello!
I am counting reads with htseq-count, and wasted some hours trying to find an extant software that would calculate FPKM and/or TPM from that output, so I wrote a script myself.
There is just one question mark - should the denominator (the sum of reads within the sample) be the sum of the reads that htseq-count successfully mapped to a feature, or the sum of reads in the input bam file?
And, if you happen to know, is it terrible if the effective length is set to 1 if it would've been calculated to negative?
Big thanks in advance!
Quick ref: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
It is commonly the number of mapped reads as only these are relevant. Imagine you had 50% contaminant reads in your library, so 50% of the reads not reflecting your gene expression results. Taking the sum of all reads would roughly underestimate the true expression by somewhat 50%.
Do you mean 'mapped' as in mapped to the genome, or mapped to an exon in my gtf file? The difference between these two numbers is quite large. Several million reads!
Typically mapped = assigned to the exome. Commonly one calculated it from a count matrix where the sum is = the sum of the column and this represents mapped and successfully assigned features. What do you need thi FPKMs for? I hope not differential expression?
OK thanks!!
I've wondered about the FPKM's myself since TPM's seem better (but not even TPM's are wholly 'liked' by the community it seems), but the others in my research group said FPKM is the standard measure in our field (leukemia), so I just rolled with it. Somehow the libraries are supposed to be prepared in such a way that we can do inter-sample comparisons even without e.g. house-keeping genes. I'm new here though so can't tell you any details.
Neither of these features is a proper normalization technique for inter-sample comparison. Check the biostats literature on normalization technique comparisons. Per-million methods regularily fail or perform poorly. You should use a proper framework like edgeR or DESeq2 for normalization and differential expression.
Many things one should do, yes. PI wants FPKM, I oblige.
Thanks anyway!
Hi Joel Wallenius , I'm having this same issue. I'm trying to convert between the htseq-count and TPM specifically. Would you be okay with sharing your script for this?
Thank you!
Hi, do you still need help? The script would be in an old zip somewheres... might take a while to dig out!
There are dozens of answers at biostars on how to convert raw counts to TPM, for example: Raw counts to TPM in R