This is a more theoretical question. Any referrals to published scientific articles on this topic would be very helpful.
The main question is whether UMI-based scRNAseq data is not compositional. I acknowledge that most of the standard scRNAseq analysis pipelines will do library size normalization, which will somehow change the UMI to compositional data. Given it is sampling from a single cell and the barcode/ adaptor should be in excess at the beginning of the reaction. If the later sequencing depth is high enough, it is rational to still normalize to the library size of the single cell?
I also acknowledge that during the library preparation steps, there may be various instances that will introduce differential amplification. My impression is that library size normalization would not correct these technical errors. May I know if that is true?
I'll say one thing about single-cell: there are a ton of technical biases that aren't accounted for in standard pipelines and their effects on downstream analysis isn't well-characterized (it's an active area of research). Whatever you do, you're making a ton of assumptions. This comment may be unhelpful but it's the truth.
As for your sequencing depth question: yes, you still need to do some sort of sequencing depth normalization. If one single cell has 8000 UMIs while another cell of the same cell type has 100 UMIs (which does happen to be due to technical artifacts), you need to normalize in order to compare the two.
As for your points on "compositional" and "differential amplification": No, "library size" normalization does not correct for that. See the following paper on what can happen when you have differential amplification https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02386-z
tl;dr if we had all these answers, there wouldn't be a new single-cell computational methods paper published every week.
Will this normalization also rule out "potential" real cells that only have 100 mRNA (I don't know, but maybe, a cell that is dying with many RNAse)?
Indeed, I asked a related question on bulk RNAseq. However, I agree with you that, this is still an outstanding question.
If you have a cell with only a few UMIs, you should discard such cells. Whether it is that low because of sequencing depth for that cell or because it's a dead cell, it'll probably be bad for your analysis to keep such a cell.
I doubt there is a 100 fold difference in the number of mRNA between two cells. If it was 8000 UMIs and 4000 UMIs, I think you'd have a point, but at 100 vs 8000, I think you have to assume there is technical differences going on.
That said, there is probably some interesting statistical work to be done in implying what fraction of transcripts within a cell have been sampled by looking at resampling rates.
Yes, because this is one of my research directions. I really want to get a sense that what is the "consensus" or convention in the community. I do find this article interesting about the amplification biases during PCR if anyone is interested.