Question

Count vs Read

4

Entering edit mode

8.4 years ago

mhyunjunkang ▴ 110

Hi everyone,

I have a very basic question, but I still don't understand. What is difference between "count" and "read" in RNA-seq data? If I say, count = read, is it right? do I understand correct? I'm completely new at this. I look forward to your kind explanation. Thanks in advance.

Mind,

RNA-Seq • 24k views

ADD COMMENT • link updated 8.1 years ago by Biostar 20 • written 8.4 years ago by mhyunjunkang ▴ 110

score 10 · Answer 1 · 2017-01-30

10

Entering edit mode

8.4 years ago

ddiez ★ 2.0k

They are not the same. A read is the oligonucleotide that has been sequenced. Counts are the number of reads that overlap at a particular genomic position. A read can map to multiple genomic positions, contributing to the counts in different ways. While the reads are inmutable (i.e. just what you obtained from sequencing), counts depend on the counting strategy (see for example an introduction to this topic in this Bioconductor course).

ADD COMMENT • link 8.4 years ago by ddiez ★ 2.0k

0

Entering edit mode

Thank you, ddiez, for a kind explanation.

Then, can I say "count", not "read", is used to calculate RPKM (read per kilobase per million)? Did I understand your explanation correctly?

Thanks, HJ

ADD REPLY • link 8.4 years ago by mhyunjunkang ▴ 110

0

Entering edit mode

Mmm, not sure about what exactly you are trying to say. Maybe this highlight in the document I linked in the answer can help. Specially the second paragraph.

The summary process tallies the number of reads aligning in each region (e.g., gene) of interest. The simplest method is to simply count reads overlapping each region, dividing by the length of the region of interest to ac- commodate differences in gene length. This is the ‘RPKM’ (reads per kilobase per million reads) of Mortazavi et al. ̃[11]. One problem with this approach is that reads are not sampled uniformly across genes (Figure ̃1; [12]), so gene length (the ‘PK’ part of RPKM) is not a good proxy for expression level.

More fundamentally, each read represents an observation, and contributes to the certainty with which a gene is measured as ‘expressed’. A summary measure like RPKM fails to incorporate uncertainty – a particular value of RPKM might result from alignment of one or 100 reads. This contrasts with a simple count of the number of reads in the region of interest. Furthermore, count data has known statistical properties that can be exploited in down- stream statistical analysis. Thus the result of summarization most useful for assessing differential expression is read count.

ADD REPLY • link 8.4 years ago by ddiez ★ 2.0k

0

Entering edit mode

I appreciate your help....

But it gives me more confusion. "read count"..-.-?

So... according to your first comment "A read is the oligonucleotide that has been sequenced", read must be something like "AGTCGATTA....". So "read" is not number and cannot be used to calculate RPKM... Am I right?

And... "Counts are the number of reads"... I understand this part like.. if two reads (ex, "AGCTGGA" and "AGGAAGT") are mapped in Gene A, then count of Gene A is 2. So, this number "2" is used to calculate RPKM...

Did I understand correctly?

Look forward to your advice.

Thanks, HJ

ADD REPLY • link 8.4 years ago by mhyunjunkang ▴ 110

0

Entering edit mode

So "read" is not number and cannot be used to calculate RPKM... Am I right?

Yes that is correct. The number of reads (i.e. the read count) is what is a number.

ADD REPLY • link 8.4 years ago by ddiez ★ 2.0k

1

Entering edit mode

Thank you, ddiez, for the kind explanation and helping me clear that up.. HJ

ADD REPLY • link 8.4 years ago by mhyunjunkang ▴ 110

0

Entering edit mode

To answer your question more directly, read counts are used to compute RPKM.

ADD REPLY • link 8.4 years ago by ddiez ★ 2.0k

score 2 · Answer 2 · 2017-01-31

2

Entering edit mode

8.4 years ago

Charles Plessy ★ 2.9k

In addition to ddiez's answer, some RNA-seq methods use unique molecular identifiers, and in this context a "count" is sometimes a shorthand for "molecule count".

ADD COMMENT • link 8.4 years ago by Charles Plessy ★ 2.9k

0

Entering edit mode

Thank you for the good information... Terminology always makes me confused... Probably it's only me... HJ

ADD REPLY • link 8.4 years ago by mhyunjunkang ▴ 110

0

Entering edit mode

Don't worry about it and keep reading and learning. I still remember being very confused about many terms related to NGS when just started to work on it (and I am still be confused about many things...).

ADD REPLY • link 8.4 years ago by ddiez ★ 2.0k