Hello! I am trying to understand a lexical detail about the count matrix used in a single-cell RNAseq experiment. I know each entry represents the number of reads mapped to a particular gene in a particular cell. In fact, I have a doubt about the exact meaning of a "read".
If I understood correctly, at the beginning of a scRNA-seq experiment, you have to break the transcripts in small pieces (because the sequencer cannot sequence too long fragments). How do we call those small pieces that we have before PCR amplification? Read? Fragment? Both? Then, we have to convert those RNA pieces into DNA and amplify them with PCR: we obtain a lot of copies that, I believe, are called "amplicons". Then we sequence all those amplicons. At that point, I also have a doubt : are all those pieces (including all the duplicates) then written in the FASTQ file? How could we know which amplicons come from the same original piece of RNA?
Once we have the FASTQ files, we can align it to our genome, then we obtain a BAM file, and at this point we create the count matrix, by counting how many lines in the BAM file correspond to an exon of each gene.
So, I would like to know if an entry in a usual count matrix represents:
The number of original "pieces" of all the transcripts matching the region of the gene before amplification? (if this is the case, how can we retrieve this number after amplification?)
The number of amplicons matching the region of the gene (therefore including all the duplicates)? (if this is the case, we assume that all the pieces were equally amplified so that those counts remain comparable?)
How do we call those small pieces that we have before PCR amplification? Read? Fragment? Both?
It's called fragment
Then, we have to convert those RNA pieces into DNA and amplify them with PCR
Depends on the sequencing technology, after cDNA generated, the amplification could occur just to amplify the detection signal, as in the cluster generation on modern Illumina process (). Old technology required PCR amplification before doing sequencing, and methods required deduplication (for WGS/WXS) or some normalization methods in RNA-seq
The number of original "pieces" of all the transcripts matching the region of the gene before amplification?
It should represent a proportional signal, low counts mean low expression, higher counts mean higher expression levels for the gene/transcript
The number of amplicons matching the region of the gene (therefore including all the duplicates)? (if this is the case, we assume that all the pieces were equally amplified so that those counts remain comparable?)
There is some bias because GC content, or sequence complexity, but in general those bias are not because amplicon duplication with recent technologies
To clarify: The original post was asking about scRNA-seq which requires many cycles of PCR amplification due to extremely low levels of input material. This results in differential amplification -> bias.
To clarify: The original post was asking about scRNA-seq which requires many cycles of PCR amplification due to extremely low levels of input material. This results in differential amplification -> bias.