Hi everyone,
I'm working with 10x single-cell multiome tumor data (scRNA-seq + scATAC-seq), and my goal is to develop a method to identify CNV patterns using SNP profiles, with a focus on somatic mutations and potential haplotype assignment.
What I have done so far:
I used cellSNP-lite with a variant list filtered for MAF > 0.05 (~7M SNPs).
It outputs AD and DP matrices, which I load into an AnnData object. Each feature is a SNP as inCHR_POS_ALT_REF format.
I filter outliers, bin SNPs, and then plan to explore CNV patterns.
With this setup:
- I detect ~1M SNPs for GEX (RNA)
- ~2.5M SNPs for ATAC -> All manageable and interpretable.
Problem with larger variant database:
When I use a larger variant database (MAF > 0.0005, ~36.6M SNPs):
- ~6M SNPs for GEX
- ~13M SNPs for ATAC
That’s a lot of data. My concern is: Since this is tumor data, it has high heterogeneity and a rich mutation landscape. Relying only on known variants might cause me to miss relevant somatic SNPs.
Ideas I’m considering:
- Running cellSNP-lite in de novo mode (without a variant list), but the runtime increases exponentially, and I don’t know how large the resulting data would get.
An anlternative is:
- Split BAMs by cell barcode
- Run bcftools mpileup in parallel
- Build AD-DP matrices from VCFs
- Create an AnnData object from these counts.
But again, this would probably also result in immense data size.
My questions:
- Is de novo approach for SNP calling worth the extra huge computational cost (for tumor datasets for CNV pattern detection)?
- Would filtering BAMs (based on flags) and splitting by cell be a reasonable and scalable pipeline?
- And importantly, should I just stick to a known variant list and accept the tradeoff in sensitivity?
Any insights or experiences would be greatly appreciated!