Question

How does gene length effect the number of reads mapped

0

Entering edit mode

14 days ago

Chen • 0

Hi, everyone! I have a question about why does gene length effect the normalisation. I understand that the sequencing depth influence the normalisation, so we need to remove the influence of sequencing depth. Many tutorials have mentioned that 'the longer genes are more likely to have more mapped reads'. But when we are using CPM to compare the expression levels about a specific gene, we don't consider the influence of gene length.

My question is (1) how does the gene length influence the number of mapped reads and (2) is the removal of the influence of gene length is necessary. Thank you for your answering in advance :).

RNA-seq CPM sequencing • 406 views

ADD COMMENT • link 3 days ago by Chen • 0

score 0 · Answer 1 · 2024-05-02

0

Entering edit mode

14 days ago

i.sudbery 19k

See my answer about the what hows and whens of RNA-seq normalisation here: Why employ normalization methods, and how can they be utilized in DEG analysis?

ADD COMMENT • link 14 days ago by i.sudbery 19k

0

Entering edit mode

Thanks for the detailed answer and I really appreciate it! I still have one question about why the number of reads can be influenced by the length of genes. I was wondering if that influence was not very strong when comparing the expression levels between two genes since some reads from the same fragments will be merged and seen as from the same gene. Thanks!

ADD REPLY • link 14 days ago by Chen • 0

1

Entering edit mode

The number of reads for a gene is almost exactly linearly proportional to the length of the gene. In paired-end sequencing, we generally count fragments, not reads, but each fragment has exactly 2 reads. Almost all counting methods discard reads/fragments where there isn't exactly two reads per fragment.

ADD REPLY • link 14 days ago by i.sudbery 19k

1

Entering edit mode

I wouldn't say it's true that number of reads is linearly proportional to the length of the gene. That's just an assumption that people make (and an assumption inherent in normalization methods such as TPMs).

ADD REPLY • link 13 days ago by dsull ★ 6.0k

1

Entering edit mode

I did say almost - obviously things are going to get tricky at gene ends - this is why we have the concept of effective length. Its intersting to think how one might test this, but I have say, I find it difficult to concieve of a generative processs for reads where this wouldn't be the case.

ADD REPLY • link 13 days ago by i.sudbery 19k

1

Entering edit mode

Alicia Oshlack has done some cool work on this but even the plots in her papers show that TPMs sometimes overcorrect and the relationship isn't really that straightforward (and varies by dataset). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5428526/

I worked with a student on this last summer but never got around to completing the project (might revisit it again this summer).

The current generative models impose a strong set of assumptions regarding fragmentation and priming that aren't actually reality.