Hi, everyone! I have a question about why does gene length effect the normalisation. I understand that the sequencing depth influence the normalisation, so we need to remove the influence of sequencing depth. Many tutorials have mentioned that 'the longer genes are more likely to have more mapped reads'. But when we are using CPM to compare the expression levels about a specific gene, we don't consider the influence of gene length.
My question is (1) how does the gene length influence the number of mapped reads and (2) is the removal of the influence of gene length is necessary. Thank you for your answering in advance :).
Thanks for the detailed answer and I really appreciate it! I still have one question about why the number of reads can be influenced by the length of genes. I was wondering if that influence was not very strong when comparing the expression levels between two genes since some reads from the same fragments will be merged and seen as from the same gene. Thanks!
The number of reads for a gene is almost exactly linearly proportional to the length of the gene. In paired-end sequencing, we generally count fragments, not reads, but each fragment has exactly 2 reads. Almost all counting methods discard reads/fragments where there isn't exactly two reads per fragment.
I wouldn't say it's true that number of reads is linearly proportional to the length of the gene. That's just an assumption that people make (and an assumption inherent in normalization methods such as TPMs).
I did say almost - obviously things are going to get tricky at gene ends - this is why we have the concept of effective length. Its intersting to think how one might test this, but I have say, I find it difficult to concieve of a generative processs for reads where this wouldn't be the case.
Alicia Oshlack has done some cool work on this but even the plots in her papers show that TPMs sometimes overcorrect and the relationship isn't really that straightforward (and varies by dataset). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5428526/
I worked with a student on this last summer but never got around to completing the project (might revisit it again this summer).
The current generative models impose a strong set of assumptions regarding fragmentation and priming that aren't actually reality.
thank you :)