Hello all,
I understand the differences between RPKM, which normalizes for read depth and then gene length; FPKM, the same thing as RPKM but does not count reads twice if both came from the same fragment; and TPM which normalizes for gene length, then read length - allowing comparison of read count proportion between samples.
However, I have not come across any literature on whether the same principle of not counting reads twice from the same fragment (FPKM) has been applied to the TPM method. I don't doubt that this is done, I would just like to know how.
Thanks
Is it not the case that because the number of reads in a paired end library is twice the number of fragments, and because r_i appears in both the numerator and the denominator, the calculated TPM will be the same whether you use fragments or reads?
If every paired-end read either maps completely (i.e. concordantly) or not at all, then yes, this is the case and the intended behavior. The notion of a read or read pair is really just secondary to the quantity we wish to count, which are cDNA fragments originating from the reverse transcribed RNA. In a paired-end experiment, a read pair is simply the result of (effectively) sequencing the opposite ends of the cDNA fragment, and so a pair of reads is really just evidence of a single fragment originating from where that read pair maps. Likewise, in a single-end experiment, you only get a single read from this fragment, and so the read is evidence of the fragment. One of the benefits of paired-end sequencing is that you know the read pair should map "together", and so there are often fewer transcript compatible with a read pair than a single read. A main point of TPM (and related measures) is to abstract beyond the purely technical notion of reads (or read pairs) and to look at inferring the primary quantity of interest — the abundance of the underlying transcripts, or, equivalently, the rate at which we expect to sample from them.
Hey Rob thanks for the detailed response to this question. Still I think I might be missing out on really understanding the average fragment length and it's relation to effective transcript length.
So if you had the time I was wondering if you could answer a question about the importance of including the 'estimated fragment length' in the calculation. Under the assumption that you are only comparing samples within an experiment (and that your goal was to have an average fragment length of size X for all samples). Is including that value really essential..? (Is it perhaps that fragment length varies rather significantly from sample to sample and therefore is included)
Also for the effective length of a transcript should you really be subtracting the read length? (As seen in this script: https://www.biostars.org/p/171766/#171837) I could understand if you count 'reads overlapping a transcript' in an 'intersection_strict' (http://www-huber.embl.de/users/anders/HTSeq/doc/_images/count_modes.png) manner [in this case its very clear that the effective transcript length is more accurate] but if for instance you count overlaps in a 'union' manner wouldn't this change the underlying assumption (that at the edge of the transcript annotation you could have a read with a single nt overlap).
I noticed you didn't mention it in your previous post concerning TPM: A: Calculating TPM from featureCounts output
Hi,
Let me try to answer your question. First, the relation of the fragment length distribution (and thus the average fragment length) to the effective transcript length.
The motivation for effective length correction is that not all positions on a transcript are equally "sequenceable". Forget, for a moment, about the technicalities of union, intersection, etc. (since these notions don't exist in the context of transcript quantification). Instead, think about drawing sequences from an actual transcript. Imagine I have a transcript of length 1000. Starting from, say, the 100th position, I could draw a fragment of length 100, 200, 300, etc. all the way up through 900. This is because if I start the fragment at position 100, there are 900 sequenceable bases downstream of the fragment start site. Now, 900 is likely larger than almost all of the actual fragments in my library. In other words, the length of the fragment I could draw starting at this position is effectively unconstrained. If we were to add up the probabilities, under our empirical fragment length distribution, corresponding to all sequenceable fragment lengths that we could draw starting at position 100, this sum would be very close to 1. Now, imagine instead that a sequenced fragment from the same transcript starts at position 950. What fragment lengths could be sequenced? Well, any fragment of length greater than 50 is not possible, since, starting at position 950, there are only 50 sequenceable bases remaining in the transcript. Thus, if we were to add up the probabilities, under our empirical fragment length distribution, corresponding to all sequenceable fragment lengths that we could draw starting at position 950, this sum would be very small (the empirical probability of seeing a fragment of length 1 or 2 or 3 or ... or 50). Thus, given the fragment length distribution, we know that not all positions on a transcript are equally "sequenceable", and this is exactly what the effective length accounts for. The effective length of the transcript is simply the sum of the sequenceable probabilities, summed over every possible starting position in the transcript. This will be small near the ends of the transcripts, and essentially 1 away from the ends. This value (the effective length) is usually very well approximated by taking the transcript's original length minus the average fragment length (as determined by the empirical fragment length distribution).
So regarding you second question, the script in the post you reference seems to subtract the average fragment length from the feature length, which agrees with the approximation I mention above. However, one potential issue with that script, and certainly something that may add to confusion, is that manner of effective length correction doesn't really make sense in the "gene counting" model. This is because the effective length is a property of the transcript (i.e. the target from which the fragments are actually being drawn). When you do gene-level analysis by simply counting the reads that hit a gene model, you really don't know the length of the transcript that is generating each fragment. Crucially, for the purposes of computing a relative abundance metric like TPM, subtracting the mean fragment length from the "gene model" is not equivalent to subtracting the mean fragment length from each of the constituent transcripts of that gene that might generate fragments at different rates. This is why tools like RSEM and tximport derive an average gene length by looking at an abundance weighted average of transcript lengths. In order for the effective length correction to make sense, you have to account for the feature from which you're actually sampling the fragments, and the union, intersection, or union-intersection gene models do not constitute such features. Let me know if any of this clarifies what I've written above.