Hi! I am doing label transfer from reference dataset and classifying two query sets that should contain exactly same cell types. I noticed that running across several iterations the classifications would be different each iterations.
reference = sc.read_h5ad("data/combined_ref.h5ad")
query1 = sc.read_h5ad("querys/unnorm_sc_C32-24h.h5ad")
query2 = sc.read_h5ad("querys/unnorm_sc_C32-72h.h5ad")
sc.pp.normalize_total(query1, target_sum=1e4)
sc.pp.log1p(query1)
sc.pp.normalize_total(query2, target_sum=1e4)
sc.pp.log1p(query2)
sc.pp.normalize_total(reference, target_sum=1e4)
sc.pp.log1p(reference)
predictions24h = pd.DataFrame()
predictions72h = pd.DataFrame()
predictions24h['id'] = list(query1.obs_names)
predictions72h['id'] = list(query2.obs_names)
features =[]
for i in range(25):
print(f"iteration{i}")
model2 = celltypist.train(reference,labels = 'CellClass', n_jobs = 10, feature_selection = True)
if i == 0:
features = model2.features
extracted = model2.features
features = list(set(extracted) & set(features))
prediction_query1 = celltypist.annotate(query1, model = model2, majority_voting=True)
prediction_query2 = celltypist.annotate(query2, model = model2, majority_voting=True)
adata2_query1 = prediction_query1.to_adata()
adata2_query2 = prediction_query2.to_adata()
predictions24h[f'run{i}'] = list(prediction_query1.predicted_labels.majority_voting)
predictions72h[f'run{i}'] = list(prediction_query2.predicted_labels.majority_voting)
As you can see in next plot I plotted for each sample (rows) the percentages of predicted cell types per sample (e.g for first sample in the graph, from the 25 iterations of cell types it got classifed 40% of the times as radial glia and 60% of the times as glioblast.
Is this behaviour expected/documented for cell typist ? What is recommended to do in this case?
Best Regards,
Manuel