What is the actual cause of excessive zeroes in single cell RNA-seq data? Is it PCR?
3
5
Entering edit mode
7.2 years ago

First, sorry if I am missing something basic - I am a programmer recently turned bioinformatician so I still don't know a lot of stuff. This is a cross post with a question on Bioinformatics SE, hope this is not bad form (its my first post on both platforms).


While it is obvious that scRNA-seq data contain lots of zeroes, I couldn't find any detailed explanation of why they occur more frequently than what would be expected from a negative binomial distribution - except for short notices along the lines of "substantial technical and biological noise". For the following text, let's assume we are looking at a single gene that is activated at approximately the same level across all cells.

If zeroes were caused solely by low capture efficiency and sequencing depth, all observed zeroes should be explained by low mean expression across cells. This however does not seem to be the case as the distribution of gene counts across cells often has more zeroes than would be expected from a negative binomial model. For Example the ZIFA paper explicitly uses a zero-inflated negative binomial distribution to model scRNA-seq data. Modelling scRNA-seq as zero-inflated negative binomial seems widespread throughout the literature.

However assuming negative binomial distribution for the original counts (as measured in bulk RNA-seq) and assuming that every RNA fragment of the same gene from every cell has approximately the same (low) chance of being captured and sequenced, the distribution across single cells should still be negative binomial (see this Math SE question for related math).

So the only remaining possible cause is that inflated zero counts are caused by PCR. Only non-zero counts (after capture) are amplified and then sequenced, shifting the mean of the observed gene counts away from zero while the pre-PCR zero counts stay zero. Indeed some quick simulations show that such a procedure could occasionally generate zero-inflated negative binomial distributions. This would suggest that excessive zeroes should not be present when UMIs are used - I checked one scRNA-seq dataset with UMIs and it seems to be fit well by plain negative binomial.

Is my reasoning correct? Thanks for any pointers.

RNA-Seq single cell pcr modelling • 5.3k views
ADD COMMENT
5
Entering edit mode
7.2 years ago
george.ry ★ 1.2k

There's a variety of things at play here, but a few main things. Others can probably add some extra considerations.

  • Phased gene expression. RNA is not produced constantly and in an even manner. If I remember correctly, the idea of scRNAseq comes from studies of transcriptional regulation, rather than for DE expression analysis.
  • Cell death and degradation or spilling of the cytoplasmic RNAs.
  • Incomplete or inefficient RT ...
  • ... linking to very binary PCR amplification. Genes come through RT+PCR and can be found expressed reasonably highly, or they're just absent.

In my experience with Fluidigm scRNAseq (primary T cells, so not necessarily as nice as cell lines etc), you recover around the top 20-25% of your most highly expressed transcripts as assessed by bulk RNAseq. The drop-out can be very substantial.

ADD COMMENT
0
Entering edit mode

Thanks for the info. I however believe that it does not answer my question. AFAIK phased gene expression should be more-or-less accounted for by the negative binomial distribution. Or am I wrong on this point? And note that I am not interested in why the large number of zeroes occur, but why there are more zeroes than what could be explained by negative binomial distribution (which can allow for a lot of zeroes if the mean is low or dispersion is high).

ADD REPLY
1
Entering edit mode

Burst transcription won't fit a negative binomial distribution, rather it'll be either zero inflated or show something like multi-modal negative binomial variance.

ADD REPLY
2
Entering edit mode
7.2 years ago

I think the simple answer might just be that handling small amount of molecules is difficult and there is more percentage molecule loss in single cell prep than in percentage loss in bulk.

Maybe there is a big difference in how our prep reagents work on the micro-environments of small number of molecules vs large amounts. Maybe a clump of large amount of molecules confers some kind of protection against loss better than small number of molecules? I cant really say, but it is an interesting question.

ADD COMMENT
0
Entering edit mode

Thanks for the note.

Maybe a clump of large amount of molecules confers some kind of protection against loss better than small number of molecules?

Do you have any link/hint/speculation on why this could be the case? From my (fairly limited) understanding of biochemistry this seems counter-intuitive. Or is there reason to believe that transcripts of the same gene tend to be localized at the same location in the cell?

ADD REPLY
1
Entering edit mode

Transcription can often occur in a burst, so at least for a short period of time a bunch of copies will be in close spatial proximity. Regardless, it's not like RNA is uniformly distributed throughout the cell. Specific transcripts will get targeted for translation to specific areas (e.g., in neurons some of them get trafficked to synapses for local translation).

ADD REPLY
0
Entering edit mode

I was referring to the library prep actually. Maybe a dense clump of extracted RNA from many cells confers some kind of protection against loss vs less molecules from a single cell. I have no evidence for this at all. And I am sure people more experienced with labwork can correct me. I am just throwing out possible explanations.

Precipitating molecules out of solution works differently depending on the concentration of reagents/molecules. Perhaps the reagents used in single-cell preps were originally optimized for bulk concentrations and don't work as efficiently with single cell?

I guess I am just trying to say that, we can be using the same reagents in both bulk vs single cell library prep and still get different prep results just based on the physical concentration of the starting RNA.

ADD REPLY
1
Entering edit mode
7.2 years ago

UMI-based scRNA-seq can have very high drop-out rates, too (Dropseq consistently only returns about 10-20% of the transcripts). Lack of capture is probably due to a mix of (relatively!) low expression and sequence bias.

I've also observed that even very strongly expressed genes are sometimes simply not present in a given cell despite all other hints (i.e., all other transcripts that were captured) supporting the notion that this cell should have been expressing a given housekeeping or marker gene.

ADD COMMENT
0
Entering edit mode

Thanks for the info. Do you have a link to the UMI-based dataset you are talking about (where there are strongly expressed genes that still have zeroes in some cells)? I'd like to check if it looks the same to me as the other datasets I have looked at.

ADD REPLY
0
Entering edit mode

The particular data sets I have in mind aren't published yet. Where did you get your example data from? You could check the single cell portal for other data sets. For example, if you go to the Retina Dataset --> "Explore" and then check the expression of Isl1, you will see that it tends to be very strongly expressed in some cell subsets. But even for those cells expected to express this gene very highly, there are some that do not.

ADD REPLY

Login before adding your answer.

Traffic: 1763 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6