Data Preparation for single-cell Machine Learning classification (svm + RF)
0
0
Entering edit mode
2.8 years ago
fracarb8 ★ 1.7k

I am working with single-cell RNA-Seq data, and I am trying to building a classifier capable of predicting if samples are controls or patients.

I have a training dataset with around 450000 cells coming from ~50 samples from different projects, each project containing both controls and patient data. The idea is to train a classifier on the 50 samples datset and predict the status of new patients as they coming in.

My question is: How do I pre-process the data of the new patients?

The reason I am confused is that for the training data, everything is normalised and scaled together. During integration, I account for the different origin of the samples by regressing out factors like sampleID, projects, experiment chemistry,.... This is not happening for the new samples, as they are analysed independently from any other sample.

This is what I did so far:

  • Integrate the ~50 samples with seurat (SCT+rPCA)
  • Run NormaliseData and ScaleData(...,vars.to.regress = c("percent.mt", "SampleID",.. )) on the RNA assay.
  • Extract the scale.data slot from the RNA assay
  • Select the list of genes to use to train the model. This is done by combining the variable genes (FindVariableFeatures) and the results of standard Feature selection algorithm (e.g. Boruta)
  • Train/test
  • Split the data (80/20)
  • Train an ensemble classifier (svm + rf)
  • Predict on the test data

I still need to tweak and improve the model, but so far, I can reach good accuracy on the prediction on the test dataset.

When a new patient arrives, I am planning on:

  • Analyse the data with seurat
  • Extract the scale.data slot from the RNA assay
  • Predict

Is it correct to feed the seurat normalised+scaled data for the prediction?

Would be better to ditch seurat entirely, start from the raw count, and normalise+scale the data in the same way for both the training samples and the new patients?

Would it be even better to integrate the new samples with the original dataset, and predict on the globally normalise dataset?

seurat machine-learning scRNA-Seq • 464 views
ADD COMMENT

Login before adding your answer.

Traffic: 1720 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6