Hi everyone, I'm clustering my single-cell data using Scanpy package, and I use rank_gene_groups to rank genes for characterizing groups. This returns names, scores, logfoldchanges, and pvals_adj for my Anndata format. I'm trying to interpret scores for this data, but I couldn't find any information except their source code. The values for scores are usually between about 1.2 to 39. This range doesn't make sense based on their definition, as below:
scores: structured `np.ndarray` (`.uns['rank_genes_groups']`)
Structured array to be indexed by group id storing the z-score
underlying the computation of a p-value for each gene for each
group. Ordered according to scores.
A couple of lines from my result is shown as follows. 0-n means the 'name' of cluster 0, followed by 0_s as 'scores', 0-p as 'pvals_adj', and 0_l as 'logfoldchanges' for cluster 0. Afterward, we have the same data for other clusters. (I have 5 clusters, but not all are shown in the image)
Could you please tell me what "scores" really mean in this result? Your help would be much appreciated! Thanks a lot in advance!
What makes you think 1.2 to 39 doesn't make sense for z-score?
Hi @shoujun.gu
The reason is that they are all positive values so I was wondering how could it be possible?! unless these are absolute values for z score, but I am not sure.
The default setting of "tl.rank_genes_groups" is to return top 100 marker gene. I think all positive z-score is expected.
Yes, it's correct that the default is n_genes=100, but I'm using all available genes as n_genes=adata.to_df().shape[1] in my code.
In the source code, 'rank_genes_groups' uses stats.ttest_ind_from_stats() for default method, which returns t-statistics. You may double check the value of your adata.shape[1], or try using something like adata.raw.shape[1] or adata.X.toarray().shape[1].
Sorry for the confusion, I updated my response above as I forgot to write the to_df() function above but I had it in my code. I have used n_genes=adata.to_df().shape[1] so that part is fine and I could see the size of the output is as I expect. Although the other way also works for me - without using to_df()