Question

How much of the genome is covered with RNA-sequencing?

1

Entering edit mode

4.2 years ago

chenl ▴ 10

When preforming a high-throughput RNA sequencing (with human samples), how many of the 3 billion base pairs of DNA will get covered after alignment and quality control?

In other words - how much of the genome can theoretically be reproduced from the RNA-seq?

I'm looking for even just a rough estimate, but if it helps, the samples that I'm interested in are from human brains, expressing ~16,000 genes, of which ~13,000 are protein coding.

I couldn't find an answer by googling, and will appreciate any help.

RNA-Seq alignment • 1.4k views

ADD COMMENT • link updated 4.2 years ago by Istvan Albert 102k • written 4.2 years ago by chenl ▴ 10

1

Entering edit mode

It is going to depend on quality of your libraries and the method used for making them. Since you are going to enrich/capture non-rRNA transcripts what gets captured/sampled in your library is fixed. In theory all such transcripts present in your sample have a chance of being captured in the library.

ADD REPLY • link 4.2 years ago by GenoMax 147k

0

Entering edit mode

Thank you for your answer! This is actually not my data, I'm just using it to do some calculations. According to the article from which it is taken, they used Illumina Stranded Total RNA Prep with Ribo-Zero Plus for the library, on cortical samples. Does this help? Or, is there a way for this to be calculated maybe?

ADD REPLY • link 4.2 years ago by chenl ▴ 10

2

Entering edit mode

4.2 years ago

lieven.sterck 15k

this actually comes down to the fraction of genome that is transcribed. For human the encode project is a good start to get a number on this.

the total size of all regions that get transcribed from a genome is the upper limit (due to biological reasons not all potential transcripts are present at any moment in the cell) so what you will get from RNAseq alignment is usually lower than that (roughly ~60-70% of it)

there are even more factors in play: for instance typically rRNA will be depleted from your RNAseq analysis (so that is a fraction of the genome that is being transcribed and present in your sample but not/barely visible/ from RNAseq.

Why do you want to know actually?

ADD COMMENT • link 4.2 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for your reply!

Just to make sure I understand you correctly - assuming about 90% of the genome gets transcribed, and the transcribed part of the genome from a single tissue (i.e. cortex in this case) will be ~60% of that - so ~54% of the genome, or about 1.62 billion bp?

And after that, how much of it would you expect to be captured?

The reason I want to know this is for a sort of enrichment analysis: some of the positions in this "population" of transcribed bp have been found to have a role in splicing, and I'm trying to figure out the population size.

ADD REPLY • link 4.2 years ago by chenl ▴ 10

0

Entering edit mode

in theory yes, but the 90% is likely quite an over-estimate.

Perhaps my answers was a bit short but then again, this is a quite debate topic and no real fixed numbers are available. My own guesstimate would be that 10-20% of the genome might get transcribed into functional things (== which you might pick up in an RNAseq analysis).

I'm far from a statistician but for some sort of enrichment analysis I would compare a 'positive' to a 'negative/neutral' sample and thus not work with a single sample data. (but I can be wrong here)

ADD REPLY • link 4.2 years ago by lieven.sterck 15k

0

Entering edit mode

You are, of course, right. I am actually using many samples for the analysis - and was mostly looking to get a feeling for the magnitude of the population that should be expected.

ADD REPLY • link 4.2 years ago by chenl ▴ 10

0

Entering edit mode

Keep in mind that life has no rules... in the cell, interactions between transcription factors, enhancers, promoters, TSS, and DNA are not judged by the letters ATGC - they are judged by electrochemical and electromagnetic interactions in the context of the 3-dimensional chromatin structure. Molecules that can promote transcription are undoubtedly binding virtually 'everywhere' they possibly can where there is an attraction, but binding only becomes sufficiently strong at certain loci such that a sustained transcription of an entire gene can occur.

Brain tissue has very specific transcriptional profiles, so, the figure of 7.5% is likely different in other bodily tissues.

ADD REPLY • link 4.2 years ago by Kevin Blighe 88k

0

Entering edit mode

assuming about 90% of the genome gets transcribed, and the transcribed part of the genome from a single tissue (i.e. cortex in this case) will be ~60% of that - so ~54% of the genome, or about 1.62 billion bp?

Not sure you can extrapolate that way. As I said before there are several limiting steps. Your sample is a time slice (whatever happened to be expressed at that time point). Efficiency of library making and what got captured in the library (not possible/feasible to convert 100% of RNA you have to libraries) is second limit. Depth of sequencing (cost and diminishing returns considerations) used to sample the library would be the third limit. So you have losses happening at each step.

You could do an approximate calculation taking a standard "known" transcriptome and working backwards from your data to see what % you were able to recover.

ADD REPLY • link 4.2 years ago by GenoMax 147k

score 3 · Accepted Answer · 2020-09-03

3

Entering edit mode

4.2 years ago

Istvan Albert 102k

I have looked at a high-quality brain sample that I had worked on.

samtools depth -a data.bam | awk ' $3>0 { count += 1  } END { print (count/NR) }'

It appears that the genome is covered at about 7.5% rate (command took 20 minutes to compute that)

It is true that the larger fraction of a genome may be observed transcribed into RNA, but those would not be picked up by a typical RNA-Seq experiment.

ADD COMMENT • link 4.2 years ago by Istvan Albert 102k

0

Entering edit mode

Can you provide some additional information about how many reads are aligned here? What is the length of the reads? How are the multi-mappers treated (allowed to multi-map or placed at one random location)?

I have seen numbers of between 10-20% coverage mentioned for single cell RNAseq but this appears to be even less than that. Interesting.

ADD REPLY • link 4.2 years ago by GenoMax 147k

0

Entering edit mode

21 million alignments, 150bp reads aligned with tophat2 (results are quite a few years old) - I think it does random placement for multi mapped reads.

This data is my go-to data for checking various RNA-Seq expectations since it is amazingly consistent across all replicates. Even the fragmentation in UTR etc are identical. (it is the first in the tracks below)

enter image description here

ADD REPLY • link 4.2 years ago by Istvan Albert 102k

0

Entering edit mode

It appears that the genome is covered at about 7.5% rate (command took 20 minutes to compute that)

In general, only 5-15% of the genome is actually transcribed in a (matured) tissue, I don't remember the reference but after work in the Illumina Body Map 2 I remember that range.

ADD REPLY • link 4.2 years ago by JC 13k

0

Entering edit mode

If you remember, can you please expand on the meaning of "mature tissue"? Is the distinction between pre- and post-natal tissue, or before and after reaching final size (aka adulthood), or something else? This might be extra important when considering the brain.

ADD REPLY • link 4.2 years ago by chenl ▴ 10

0

Entering edit mode

The transcriptional 'programme' will differ depending on the cell cycle, tissue, and stage of development. In the brain, for example, we would make a distinction between mature and other astrocytes.