Question

Understanding "drop-out" for scRNA-seq data

0

Entering edit mode

21 months ago

Alexander ▴ 220

"Dropout" for single cell RNA sequencing data is phenomena that some genes which are biologically expressed may nevertheless NOT be observed by the scRNA-seq procedure - i.e. if you get zero - that does not mean that it is really (biologically) zero.

How should we think of statistical properties of the "dropout" for scRNA-seq data ?
Should we think of it as a kind of uniform over cells x genes or more probably genes which have less expression have more probability to have a droupout ? What are some biological reasons for the "dropout" ? What are some good sources to read about ?

Here is example which is somewhat puzzling for me: For the HIGHER values of the protein we do NOT see non-zero RNA at all ! How that can be explained ? It is counterintuitive since higher values of protein typically should require higher values of the RNA.

enter image description here (From Antonina Dolgorukova notebook here: https://www.kaggle.com/code/antoninadolgorukova/citeseq23-exploratory-analysis?scriptVersionId=120760907&cellId=43 ) That the CITE-seq scRNA -seq technology - we have BOTH protein CD197 (X-axis) and RNA CD197 - (Y-axis) . (Color corresponds to yet another protein - CD19).

scRNA-seq • 917 views

ADD COMMENT • link updated 21 months ago by zdebruine ▴ 120 • written 21 months ago by Alexander ▴ 220

score 2 · Answer 1 · 2023-03-01

One likely biological scenario might be different protein vs RNA kinetics. Proteins can be expressed long after their cognate transcripts have been degraded. Often, RNAs are targeted for degradation after translation, otherwise uncontrolled translation could occur.

The premise of your question seems to be that there should be a linear (or nearly linear) association between protein and transcript presence. However, your data is normalized and both the method of normalization and the counts/presence of other transcripts or proteins can confound this assumption. Even on raw counts, there is no guarantee that a single RNA transcript could not be translated hundreds of times, causing significant asymmetry. Furthermore, the transcriptional and protein contexts in the cell at the time of transcription can affect the kinetics of translation.

This all comes down to a simple fact: to predict a cognate protein from abundance of its transcript is highly underestimating the complexity of context, and what you really need is a rich model that considers context (and possibly prior information) in the prediction of protein abundance.