Question

What Metrics Are Best To Describe The "Coverage" Of Rna-Seq Data?

21

Entering edit mode

14.0 years ago

Bio_X2Y ★ 4.4k

As far as I know, DNA sequencing coverage is simply calculated as (read count x read length / genome size). This means, for example, that an experiment might be described as "40x coverage".

For RNA-seq, my initial instinct would have been to calculate a "coverage" figure in much the same way, but dividing by the number of annotated bases rather than full genome length. When I think about it, however, this figure seems fairly meaningless for RNA-seq. From what I understand, DNA sequencing coverage is fairly uniform across the genome (telomeres and repeat regions excepted), so the simple coverage figure is informative - you can be reasonably sure that a typical base in a "40x coverage" genome will be represented by about 40 reads.

For RNA-seq, however, the variation in transcript levels surely makes this figure meaningless? A coverage of "10x" would tell you next to nothing about the number of reads supporting a typical base.

So, my question is - is there any other metric that people would use to convey the strength/coverage of their data set? Would "read count" still be the best available? Has anybody used anything like the read count associated with a housekeeping gene(s) to compare coverage?

Thanks.

next-gen sequencing rna rpkm data • 31k views

ADD COMMENT • link updated 8.7 years ago by Biostar 20 • written 14.0 years ago by Bio_X2Y ★ 4.4k

score 14 · Answer 1 · 2010-11-12

Thanks for this question.

You are correct that the intuitive interpretation of "coverage" for RNA-seq data is problematic. The reason for this is imo: We do not know the size/length of the transcriptome. This is essential when comparing e.g. two different samples by RNA-seq. In theory this means that two samples are directly comparable with respect to coverage, if and only if the combined length of all transcripts (molecules concatenated) in both samples is equal. Otherwise, this needs to be normalized. Thus, in this context I prefer to talk about "pseudo-coverage" rather. This is especially true when comparing on the gene level.

I recommend the study of Bullard et al. [1] where they compared different approaches to normalization (e.g. RPKM, upper quartile normalization, quantile normalization, house-keeping gene normalization). Short conclusion, they prefer upper quartile normalization over RPKM.

One should maybe also re-phrase the question bit to something like: "What is the the best way to accurately model RNA-seq data and its variability?" The consensus (really?) seems to be that variability in technical replicates can be modelled by a Poisson distribution, while biological samples show overdispersion and can be modelled by a negative binomial distribution, both Rpackages on RNA-seq (edgeR, DEseq) take this approach with some differences in parameter estimation and testing.

I have collected some papers for a seminar in a Mendeley group: http://www.mendeley.com/groups/617481/rna-seq/

You can find my slides here: http://www.ii.uib.no/~inge/inf389-2010/MD_RNASeqIntro.pdf Feedback is welcome...

score 4 · Answer 2 · 2010-11-12

4

Entering edit mode

14.0 years ago

Gww ★ 2.7k

You can't really use a metric like that to get a general idea of your coverage, because the level of expression of each of the genes can vary by sample.

Perhaps you could calculate the RPKM (Reads per KB per million reads) for each gene and plot their distribution? or for the distribution for a subset of all the genes.

ADD COMMENT • link 14.0 years ago by Gww ★ 2.7k

2

Entering edit mode

In addition, an organism with differentiated cell types would not be expected to be expressing all genes in all tissues, so your expected sequence-space would shift by tissue. Also, there is a likely to be a sampling bias that varies with the laboratory protocol e.g. the 3' bias observed with polyA-selected sequences

ADD REPLY • link 14.0 years ago by biobot 0.0.77.a.1099 6.2k

1

Entering edit mode

I think it would be really difficult to come up with a single number. There are all kinds of sources of experimental bias that could occur between samples, even derived from the same genome. For example, maybe one sample had significantly less rRNA contamination. You'd expect that this sample would have higher coverage everything else than the same sample with more rRNA contamination.

ADD REPLY • link 14.0 years ago by Gww ★ 2.7k

1

Entering edit mode

Nope, if you check the text of the paper I linked they describe it as "Reads per kilobase of exon model per million mapped reads (RPKM)"

ADD REPLY • link 14.0 years ago by Gww ★ 2.7k

0

Entering edit mode

Thanks GWW. At the moment, that's basically what we're going internally. I was wondering there is a single "number" that could summerize the strength of an experiment in some way. E.g. if I heard that another group had sequenced the same genome as us, I'd be immediately interested in knowing which experiment has better coverage. At the moment, I'd probably just go with "read count" (assuming same sequencing platform, etc.) to get a feel for how they stack up to each other...

ADD REPLY • link 14.0 years ago by Bio_X2Y ★ 4.4k

0

Entering edit mode

I though RPKM means "reads per kilobase mapped" not "reads per kb per million".

ADD REPLY • link 14.0 years ago by Daniel Standage 4.1k

0

Entering edit mode

If you have a normalized library (where the coverage of each transcript in the sample is expected to be uniform) I think the coverage calculated as in genome experiments could work and could be useful to compare the depth of the sequencing process between different experiments.

ADD REPLY • link 13.9 years ago by Marina Manrique ★ 1.3k

score 4 · Answer 3 · 2010-11-12

Although there is some theory for the minimum coverage necessary for DNA assembly, in practice DNA sampling is far from uniform, and in practice, coverage doesn't really say much about the quality of the resulting genome assembly. The closest thing to genome coverage for RNA would be size of reads divided by size of transcriptome, but as you say, the distribution is more skewed for RNA than for genome, making it even more useless.

What exactly is it that you wish to measure? Do you wish to measure the RNA diversity in your sample, or do you wish to measure the effectiveness of sample normalization? From your description, it sounds like the latter. The distribution of transcripts tend to follow an approximate power law (Pareto), so you could perhaps fit the distribution to that, and measure the slope (which will be steeper in the non-normalized case)

score 2 · Answer 4 · 2011-01-13

Hi Jeremy Leipzig,

As almost everyone says RPKM is the most used metric, also to describe gene expression level in RNA-seq experiments.

However I would like to point out that if you have a normalized library where the coverage is expected to be uniform in your transcripts, the calculation of the coverage as done in genome experiments (as you said dividing by the number of annotated bases) could be correct.

I think the key point is if the coverage is expected to be uniform in your data (as it's supposed to be in normalized libraries) or not,

score 1 · Answer 5 · 2010-11-12

1

Entering edit mode

14.0 years ago

Larry_Parnell 16k

AS GWW writes, RPKM is the metric I most often see. One does need to consider heterozygous alleles such that one may see 100 reads across a given exon of a gene, 60 of which are A at a certain polymorphic base and another 40 which may be C at that position.

The coverage parameter must be calculated on a per exon basis - because of alternate splicing.

ADD COMMENT • link 14.0 years ago by Larry_Parnell 16k

1

Entering edit mode

Two different isoforms of an mRNA can still use the same exon(s) in one part of the message and alternate exons in a different part. For example, alternate exon 1, but exons 2 through 5 are all identical. Thus, your reads for exons 2 through 5 tell you nothing about which isoform was actually expressed and detected by sequencing.

ADD REPLY • link 13.9 years ago by Larry_Parnell 16k

0

Entering edit mode

About calculating coverage parameter per exon. Why wouldn't you calculate based on transcripts? Transcripts should represent isoforms so alternate splicing should be considered. Anyway I'd like to have both coverage values, I think exon based coverage could be useful to check the transcript assembly

ADD REPLY • link 13.9 years ago by Marina Manrique ★ 1.3k

score 0 · Answer 6 · 2010-11-12

For a novel genome coverage described something along the lines: how many bases/putative exons does your RNA-Seq data cover at least 2x in total makes sense (subtract repetitive and ribosomal sequences). "Good" data set will let you predict more transcripts.

For expression studies you can try to plot two numbers: total length of genomic bases mapped at some threshold coverage (2x) -> see above total length of RNA-Seq bases successfully mapped

Things to watch are unspliced mRNAs, which will badly mess up these numbers.