Question

Applying the metacell2 algorithm using python

0

Entering edit mode

13 days ago

JACKY ▴ 140

I've been trying to implement the divide-and-conquer Metacell algorithm in Python, trying to classify cells. I've successfully installed the necessary library and attempted to follow the pipeline outlined here.

However, I'm struggling to understand the purpose and significance of each step in the process. For instance, the first step, 'Exclude', assesses whether any genes in my AnnData object correlate with lateral genes, which, from my understanding, is undesirable.

What should I do with this information? The guidelines aren't clear, and I'm also unsure what qualifies as a lateral gene.

Moreover, I'm confused about the differences between the 'Direct' step and the 'Divide and Conquer' step.

If anyone has experience with this method and can explain how to use it in a straightforward manner, I would greatly appreciate your guidance.

python single-cell scanpy metacell2 • 652 views

ADD COMMENT • link updated 4 days ago by Wayne ★ 2.0k • written 13 days ago by JACKY ▴ 140

0

Entering edit mode

"I'm also unsure what qualifies as a lateral gene."

The README where you reference has an entire section on this under the heading 'lateral_gene mask':

"Lateral genes are forbidden from being selected for computing cells similarity (e.g., cell cycle genes). In version 0.8 these were called “forbidden” genes. Lateral genes are still counted towards the total UMIs count when computing gene expression levels for cells similarity. In addition, lateral genes are still used to compute deviant (outlier) cells. That is, each computed metacell should still have a consistent gene expression level even for lateral genes.
The motivation is that we don’t want the algorithm to even try to create metacells based on these genes. Since these genes may be very strong (again, cell cycle), they would overcome the cell-type genes we are interested in, resulting in for example an “M-state” metacell which combines cells from several (similar) cell types.
Deciding on the “right” list of lateral genes is crucial for creating high-quality metacells. We rely on the analyst to provide this list based on prior biological knowledge. To support this supervised task, we provide the relate_genes pipeline for identifying genes closely related to known lateral genes, so they can be added to the list."

I note that mc.pl.relate_genes() gets used in two notebooks in this repository by the same group that contains all the code for reproducing the analysis from the manuscript "Time-Aligned Hourglass Gastrulation Models in Rabbit and Mouse", which was done with the metacells package:

2-metacells/mm_metacells.ipynb
2-metacells/oc_metacells.ipynb

Plus, theres a Vignettes repo that "give examples for using the metacells."

Exploring those example notebooks & Vignettes would probably help with a lot of the specific things you mention and your broader question about using this package.

ADD REPLY • link 13 days ago by Wayne ★ 2.0k

0

Entering edit mode

I've followed this vignette

Which is the only pipeline that does not rely on pre knowledge (like in my case). But I don't see they performed the metagroup stage, which is a crucial step if you're familiar with the algorithm. They only repeated the metacell stage several times and stopped there.

ADD REPLY • link 12 days ago by JACKY ▴ 140

0

Entering edit mode

Yes, they are very clear about this. The page about the Vignettes says:

"They are not meant as a comprehensive documentation of all the features, and the data contained in them should not be used for any serious analysis. "

Hopefully, the repo with the "all the code for reproducing the analysis from the manuscript "Time-Aligned Hourglass Gastrulation Models in Rabbit and Mouse"" is more informative for your needs.
Otherwise at the bottom it clearly says:

"For help, please contact ofir.raz@weizmann.ac.il"

ADD REPLY • link 12 days ago by Wayne ★ 2.0k

0

Entering edit mode

The repo does not provide any valuable information unfortunately. The Metacell algorithm they provide in the vignette is missing the metagroup stage (and they skipped all kinds of stuff to keep it simple), and the repo can only direct me to some other GitHub page with some more Metacell functions, which I do not understand how to use. The explanation of this algorithm in the Metacell 2 paper is clear, while the implementation pipeline is not clear at all. Anyway, thank you for helping !

ADD REPLY • link 12 days ago by JACKY ▴ 140

score 0 · Answer 1 · 2024-05-07

0

Entering edit mode

12 days ago

Wayne ★ 2.0k

I had hoped things like this Jupyter notebook for the rabbit data analysis that is at the repo with the "all the code for reproducing the analysis from the manuscript "Time-Aligned Hourglass Gastrulation Models in Rabbit and Mouse" which seems quite thorough, combined with the description of the algorithm might help you get started.

ADD COMMENT • link 12 days ago by Wayne ★ 2.0k

0

Entering edit mode

The vignette I followed, although lacking, seems better since it focuses on doing iterative process. I'll keep looking perhaps I'll find some other tutorials for this package.

ADD REPLY • link 11 days ago by JACKY ▴ 140

0

Entering edit mode

Sorry, this was supposed to be a comment appended to the chain above. I didn't mean to post as an answer.

ADD REPLY • link 11 days ago by Wayne ★ 2.0k

0

Entering edit mode

I have one more question please. When reading the h5ad file at the very beginning, do I need to perform the usual scanpy normalization steps before doing any metacell analysis ? meaning those three steps :

sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean = 0.0125, max_mean = 3, min_disp = 0.5)

ADD REPLY • link 4 days ago by JACKY ▴ 140

0

Entering edit mode

I'm unsure. I will say that looking for normalize among the metacells code, I see what looks like pertinent handling in metacells/pipeline/exclude.py and metacells/tools/bursty_lonely.py, particularly:

from metacells.tools.high import find_high_normalized_variance_genes
from metacells.tools.high import find_high_total_genes

and

htv_mask_series = find_high_normalized_variance_genes(
            ht_data, "downsampled", min_gene_normalized_variance=min_gene_normalized_variance, inplace=False

And so if that is going to be end up as part of the analysis applied to your data, it looks like it handles it itself, and so you'd not want to do it upstream. That's just my guess based on a quick perusal. You can always try with and without and compare. Or contact the authors. Or at least post a separate question here as this is probably a separate topic unto itself.

ADD REPLY • link 4 days ago by Wayne ★ 2.0k