Hello,
My results from cellranger and starsolo are very different.
The estimated cell number of cellranger count with default parameters was 9400 while starsolo estimated 3600. I am confused by the big variation. When I check the results from cellranger, there are two big cell clusters with very high UMI count (>25k) while the rest cell clusters have UMI <5k. Starsolo results have a more evenly distribution UMI across clusters.
I wonder if starsolo uses a doublet filter for cell calling? what the reason for the difference?
Thank you!
Dedails:
starsolo:
STAR --genomeDir starsolo --soloType CB_UMI_Simple --soloCBwhitelist 10x_V3_whitelist.txt --soloUMIlen 12 --readFilesIn ${wd}/2270183_P7_2_S2_L001_R2_001.fastq.gz,${wd}/2270183_P7_2_S2_L002_R2_001.fastq.gz ${wd}/2270183_P7_2_S2_L001_R1_001.fastq.gz,${wd}/2270183_P7_2_S2_L002_R1_001.fastq.gz --runThreadN 20 --outFileNamePrefix s1 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped elp1_s2_Unmapped --twopassMode Basic --chimSegmentMin 20 --readFilesCommand zcat --clipAdapterType CellRanger4 --outFilterScoreMin 30 --soloCBmatchWLtype 1MM_multi_Nbase_pseudocounts --soloUMIfiltering MultiGeneUMI_CR --soloUMIdedup 1MM_CR
Barcodes.stats: nNoAdapter 0 nNoUMI 0 nNoCB 0 nNinCB 0 nNinUMI 6289 nUMIhomopolymer 286950 nTooMany 0 nNoMatch 21844303 nMismatchesInMultCB 0 nExactMatch 887950279 nMismatchOneWL 5268661 nMismatchToMultWL 19035523 Features.stats: nUnmapped 89644891 nNoFeature 373085155 nAmbigFeature 13249278 nAmbigFeatureMultimap 11984809 nTooMany 1363653 nNoExactMatch 0 nExactMatch 427453001 nMatch 434911486 nCellBarcodes 2149182 nUMIs 131380805
Summary.csv: Number of Reads,934392005 Reads With Valid Barcodes,0.974849 Sequencing Saturation,0.697914 Q30 Bases in CB+UMI,0.948347 Q30 Bases in RNA read,0.923572 Reads Mapped to Genome: Unique+Multiple,0.898433 Reads Mapped to Genome: Unique,0.756247 Reads Mapped to Transcriptome: Unique+Multipe Genes,0.479628 Reads Mapped to Transcriptome: Unique Genes,0.465449 Estimated Number of Cells,3618 Reads in Cells Mapped to Unique Genes,239834914 Fraction of Reads in Cells,0.551457 Mean Reads per Cell,66289 Median Reads per Cell,59441 UMIs in Cells,67707559 Mean UMI per Cell,18714 Median UMI per Cell,16493 Mean Genes per Cell,4247 Median Genes per Cell,4175 Total Genes Detected,23201
cellranger:
$cellranger count --id=s1 --transcriptome=refdata-gex-mm10-2020-A --fastqs=2-1649641 --localcores=20 --localmem=300
Estimated Number of Cells | 9449 Mean Reads per Cell | 98887 Median Genes per Cell | 1583934 Number of Reads | 392005 Valid Barcodes | 97.10% Sequencing Saturation | 69.90% Q30 Bases in Barcode | 94.90% Q30 Bases in RNA Read | 92.40% Q30 Bases in UMI | 94.70% Reads Mapped to Genome | 90.00% Reads Mapped Confidently to Genome | 85.30% Reads Mapped Confidently to Intergenic Regions | 7.00% Reads Mapped Confidently to Intronic Regions | 26.40% Reads Mapped Confidently to Exonic Regions | 51.90% Reads Mapped Confidently to Transcriptome | 48.20% Reads Mapped Antisense to Gene | 2.70% Fraction Reads in Cells | 64.80% Total Genes Detected | 23824 Median UMI Counts per Cell | 3577
You can check the number of UMIs per cell. They might be using a different cutoff for the minimum number of UMIs to call a cell versus an empty droplet.
the median UMI for cell in cellranger is 3577 and in starsolo is 16493 which is very high I think. So I wonder the issue might not be the droplet but doublet..
Neither tool detects doublets. Don't worry about means/medians. Look at the range. If one tool goes from 50 to 50k UMIs and the other goes from 500 to 50k UMIs, there is your answer. You can also overlap cells detected with both methods and see what the range is for the cells that only appear by one method.