I know that Malat-1 expression is an indicator of dying cells. Would it be reasonable to filter cells with high Malat-1 expression? Or would it be better to regress out the Malat-1 gene during scaling?
I know that Malat-1 expression is an indicator of dying cells. Would it be reasonable to filter cells with high Malat-1 expression? Or would it be better to regress out the Malat-1 gene during scaling?
My comment is general since I've never looked at this gene specifically, but metrics of poor cell quality in my experience never come alone. If you have dying cells then they will also have a good fraction of mitochondrial reads, hence fewer other genes are detected and typically trash cells will somewhat aggregate together in a UMAP plot. If you see that suspicious cells are also high in this gene then maybe yes, filter. If only this gene indicates "dying cells" then maybe it is some other biology involved.
I'm assuming you read https://kb.10xgenomics.com/hc/en-us/articles/360004729092-Why-do-I-see-high-levels-of-Malat1-in-my-gene-expression-data
In my experience, Malat1 is just some weird artifact that is a highly captured gene in a huge amount of scRNA-seq datasets regardless of protocol and I get good results without doing anything about it.
And the standard for the field is: Don't do anything about it. :)
Malat1 is a lncRNA abundant in the nucleus. I guess if Malat1 is abundant and stable, makes sense it could be detect in high amounts in scRNA-seq. scRNA-seq doesn't dissociate the nucleus (at least not completely) in many cases. There's a reason why >25% of transcripts in scRNA-seq datasets are unspliced.
The "nucleus-ness" of single cells would be an interesting technical effect to look more closely at, as it does drive clustering results -- but one shouldn't assume that "nucleus-ness" = suboptimal/dead cells.
You can consider the Malat1-high cluster an undefined cluster if you don't find it interesting or have trouble annotating it, but I wouldn't threshold on Malat1 expression since it's such an abundantly expressed gene in many cells.
Malat1 correlates with the intronic content and can be used as a nuclear indicator. In this preprint we discuss about this artifact and the usage of Malat1 or intronic content as quality metrics: https://www.biorxiv.org/content/10.1101/2024.04.18.590104v2
Thank you for sharing.
What are your thoughts on the genes Gm42418, AY036118, and Gm26917? Some articles suggest that clusters with an abundance of these genes have been directly removed from the data. What exactly identifies these genes as contamination? And what would it mean in a dataset where intronic reads are included?
I assume there is no way to calculate the intronic fraction of cells if only count data is available, as you mentioned in your article. So if I must stick to the Malat1 gene, in which way should I calculate the Malat1 gene? Should I score the data using the "AddModuleScore()" function, or should I only calculate the percentage of the Malat1 gene? In either case, what would be the threshold?
I briefly looked at those genes on a genome browser -- there are a lot of repeat elements in those genes, meaning a lot of counts assigned to them are probably spurious (for example, rRNA might map to those genes). Perhaps they might correlate with intron content because many introns have low complexity sequences? Or maybe because rRNAs are abundant in the nucleus and those are nuclear genes? You'd have to check.
In your paper, you mentioned that a normalization value of 0 for Malat1 is also indicative of a low-quality cell. But what if the dropout effect is at play? Let's say a cell's expression of Malat1 is equal to 0, but the unique gene and RNA count appear to be quite normal. In that case, should this cell still be removed from the data?
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I have noticed in articles that people define low-quality clusters and remove them from the data, but they don't exactly explain what makes these clusters low quality. Malat1 is a nuclear gene, and it shouldn't be detected in high amounts in single-cell analysis, I assume? So if a cell has a high amount of nuclear genes, does it mean that they should be discarded?
Could we say that if a cell has a lower amount of unique genes and detected RNA molecules and also has a high amount of nuclear genes, this cell's RNA molecules in the cytoplasm disappeared somehow and only nuclear genes are detected?