I am working with single-cell RNA-Seq data, and I am trying to building a classifier capable of predicting if samples are controls or patients
.
I have a training dataset with around 450000 cells coming from ~50 samples from different projects, each project containing both controls and patient data. The idea is to train a classifier on the 50 samples datset and predict the status of new patients as they coming in.
My question is: How do I pre-process the data of the new patients?
The reason I am confused is that for the training data, everything is normalised and scaled together. During integration, I account for the different origin of the samples by regressing out
factors like sampleID
, projects
, experiment chemistry
,....
This is not happening for the new samples, as they are analysed independently from any other sample.
This is what I did so far:
- Integrate the ~50 samples with seurat (
SCT+rPCA
) - Run
NormaliseData
andScaleData(...,vars.to.regress = c("percent.mt", "SampleID",.. ))
on the RNA assay. - Extract the
scale.data
slot from the RNA assay - Select the list of genes to use to train the model. This is done by
combining the variable genes (
FindVariableFeatures
) and the results of standard Feature selection algorithm (e.g.Boruta
) - Train/test
- Split the data (80/20)
- Train an ensemble classifier (
svm + rf
) - Predict on the test data
I still need to tweak and improve the model, but so far, I can reach good accuracy on the prediction on the test dataset.
When a new patient arrives, I am planning on:
- Analyse the data with seurat
- Extract the
scale.data
slot from the RNA assay - Predict
Is it correct to feed the seurat normalised+scaled data for the prediction?
Would be better to ditch seurat entirely, start from the raw count, and normalise+scale the data in the same way for both the training samples and the new patients?
Would it be even better to integrate the new samples with the original dataset, and predict on the globally normalise dataset?