I have a very basic question, but I still don't understand.
What is difference between "count" and "read" in RNA-seq data?
If I say, count = read, is it right? do I understand correct?
I'm completely new at this.
I look forward to your kind explanation.
Thanks in advance.
They are not the same. A read is the oligonucleotide that has been sequenced. Counts are the number of reads that overlap at a particular genomic position. A read can map to multiple genomic positions, contributing to the counts in different ways. While the reads are inmutable (i.e. just what you obtained from sequencing), counts depend on the counting strategy (see for example an introduction to this topic in this Bioconductor course).
Mmm, not sure about what exactly you are trying to say. Maybe this highlight in the document I linked in the answer can help. Specially the second paragraph.
The summary process tallies the number of reads aligning in each region (e.g., gene) of interest. The simplest method is to simply count reads overlapping each region, dividing by the length of the region of interest to ac- commodate differences in gene length. This is the ‘RPKM’ (reads per kilobase per million reads) of Mortazavi
et al. ̃[11]. One problem with this approach is that reads are not
sampled uniformly across genes (Figure ̃1; [12]), so gene length (the
‘PK’ part of RPKM) is not a good proxy for expression level.
More
fundamentally, each read represents an observation, and contributes to
the certainty with which a gene is measured as ‘expressed’. A summary
measure like RPKM fails to incorporate uncertainty – a particular
value of RPKM might result from alignment of one or 100 reads. This
contrasts with a simple count of the number of reads in the region of
interest. Furthermore, count data has known statistical properties
that can be exploited in down- stream statistical analysis. Thus the
result of summarization most useful for assessing differential
expression is read count.
But it gives me more confusion. "read count"..-.-?
So... according to your first comment "A read is the oligonucleotide that has been sequenced", read must be something like "AGTCGATTA....". So "read" is not number and cannot be used to calculate RPKM... Am I right?
And... "Counts are the number of reads"... I understand this part like.. if two reads (ex, "AGCTGGA" and "AGGAAGT") are mapped in Gene A, then count of Gene A is 2. So, this number "2" is used to calculate RPKM...
In addition to ddiez's answer, some RNA-seq methods use unique molecular identifiers, and in this context a "count" is sometimes a shorthand for "molecule count".
Don't worry about it and keep reading and learning. I still remember being very confused about many terms related to NGS when just started to work on it (and I am still be confused about many things...).
Thank you, ddiez, for a kind explanation.
Then, can I say "count", not "read", is used to calculate RPKM (read per kilobase per million)? Did I understand your explanation correctly?
Thanks, HJ
Mmm, not sure about what exactly you are trying to say. Maybe this highlight in the document I linked in the answer can help. Specially the second paragraph.
I appreciate your help....
But it gives me more confusion. "read count"..-.-?
So... according to your first comment "A read is the oligonucleotide that has been sequenced", read must be something like "AGTCGATTA....". So "read" is not number and cannot be used to calculate RPKM... Am I right?
And... "Counts are the number of reads"... I understand this part like.. if two reads (ex, "AGCTGGA" and "AGGAAGT") are mapped in Gene A, then count of Gene A is 2. So, this number "2" is used to calculate RPKM...
Did I understand correctly?
Look forward to your advice.
Thanks, HJ
Yes that is correct. The number of reads (i.e. the read count) is what is a number.
Thank you, ddiez, for the kind explanation and helping me clear that up.. HJ
To answer your question more directly, read counts are used to compute RPKM.