Hi
I have a query. I am trying to align my Plasmodium scrnaseq data against combined reference genomes of Human and PF3D7. Since these cells are from ring stage, the number of genes expressed is really low (25-50 genes expressed per cell on an average) and so are the UMIs. This matches with the Neutrophil's case study explained by 10X here. I am however, preplexed, when I force the number of cells to be 10k I get 17K cells as an output. Though after removal of human genes and cells according to following filtering criteria (CreateSeuratObject(counts = counts(seurat_data), project = name, min.cells = 3, min.features = 10)
), I get around 9K cells vs 3K gene matrix, I am worried if something wrong might be happening when the cells are estimated when aligning to multiple genomes?
The cellular barcode detection happens independent of the genome(s) you use, based on my understanding. It is basically counting how often each barcode is detected, and then the knee method is used to decide if the barcodes are likely to be real (because detected frequently) or rather due to noise. If you use the force option then you overwrite all of this. Given the discrepancy between detected (by knee) and forced method, I think you are counting a great deal of noisy (=artificial) cells/barcodes. I cannot comment on specifics with Plasmodium or these types of organisms in general, but with 50 genes per cell I wonder what usability you have for the data.
I would check if these 9k cells with 3k gene matrix yields anything substantial, or whether this is just random counts across a lot of artificially counted cells. Thinking aloud here.
Hi ATpoint thanks for your input. Yes I was skeptical about the data too in the beginning but it is what it is and I have been told that at the time-point at which data was harvested , the parasite shows really low number of gene expression. So this behavior's expected.