Cannot use more than 2 sample IDs when demultiplexing using Demuxlet
1
4
Entering edit mode
3 months ago
Chilly ★ 1.3k

Hi Everyone, I currently have a scRNA-seq BAM file of 6 coral individuals mixed together with the genotype (SNP) of each coral. Coral individuals are labelled (i.e. sample IDs) 01, 02, 03, 04, 05, and 06, respectively. I merged the 6 genotype vcf files into one, and some of the results are as follows (I omitted the header starting with '##'):

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  01      02      03      04      05      06 
scahrs1_100     474321  .   A       G       35.64   PASS    BaseQRankSum=1.981;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=5.09;ReadPosRankSum=-0.712;SOR=0.223;DP=7;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  ./.:.:.:.:.    0/1:5,2:7:43:43,0,122    ./.:.:.:.:.    ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:. 
scahrs1_100     474331  .  C       G       35.64   PASS    BaseQRankSum=1.981;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=5.09;ReadPosRankSum=-0.566;SOR=0.223;DP=7;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  ./.:.:.:.:.    0/1:5,2:7:43:43,0,122    ./.:.:.:.:.    ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:. 
scahrs1_100     1349979 .  T       C       60.64   PASS    BaseQRankSum=-4.543;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=2.76;ReadPosRankSum=0.033;SOR=1.085;DP=22;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  ./.:.:.:.:.    ./.:.:.:.:.      ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     0/1:9,13:22:30:68,0,30 
scahrs1_100     1399140 .       A       C       63.64   PASS    BaseQRankSum=0;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=7.95;ReadPosRankSum=-1.611;SOR=1.179;DP=8;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  ./.:.:.:.:.    ./.:.:.:.:.      ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     0/1:3,5:8:33:71,0,33 
scahrs1_100     1742935 .       G       T       30.64   PASS    BaseQRankSum=4.912;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=1.18;ReadPosRankSum=0.363;SOR=0.446;DP=26;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  ./.:.:.:.:.    ./.:.:.:.:.      ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     0/1:15,11:26:38:38,0,76 
scahrs1_100    1945135 .       C       T       98.64   PASS    BaseQRankSum=4.216;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=4.48;ReadPosRankSum=-0.1;SOR=1.214;DP=24;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  0/1:13,9:22:99:106,0,182        ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:. 
scahrs1_100     1945147 .       C       T       98.64   PASS    BaseQRankSum=4.216;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=4.48;ReadPosRankSum=0.735;SOR=1.214;DP=22;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  0/1:13,9:22:99:106,0,182        ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:. 
scahrs1_100     7089066 .       C       A       67.64   PASS    BaseQRankSum=0.674;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=13.53;ReadPosRankSum=1.036;SOR=0.446;DP=6;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  0/1:3,2:5:75:75,0,102   ./.:.:.:.:.     ./.:.:.:.:.    ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:. 
scahrs1_100     7089067 .  G       A       67.64   PASS    BaseQRankSum=1.383;ExcessHet=0;FS=0;MQ=60;MQRankSum=0;QD=13.53;ReadPosRankSum=0.674;SOR=0.446;DP=6;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1 GT:AD:DP:GQ:PL  0/1:3,2:5:75:75,0,102   ./.:.:.:.:.     ./.:.:.:.:.    ./.:.:.:.:.     ./.:.:.:.:.     ./.:.:.:.:.

When I try to demultiplex using demuxlet, any TWO sample IDs (e.g. 01 & 02, or 02 & 03) can run the pipeline completely, and the results are as expected. For example:

demuxlet --sam ~/tmp.storage/Sca_GEX_1_2023-9.bam --vcf ~/tmp.storage/all.coral.ID.vcf --sm 01 --sm 02 --out ~/demuxlet.result.di/Sca_GEX_1.demuxlet --group-list ~/tmp.storage/Group1_0min.tsv --field GT

will output 3 files: output 3 files

And the result summary of samples 01 & 02 shows that many single cells can be identified (below, 318 cells and 976 cells). This indicates that the source file and pipeline run well.

(Demultiplex) u67@hoff:~$ singularity exec Demuxafy.sif bash Demuxlet_summary.sh ~/demuxlet.result.di/Sca_GEX_1.demuxlet.best       
Classification  Assignment N 
AMB-01-02-01/02 511 
AMB-01-02-02/01 47 
AMB-02-01-01/02 558 
AMB-02-01-02/01 85 
DBL-01-02-0.500 2 
DBL-02-01-0.500 1 
SNG-01  318 
SNG-02  976

But whenever I expand the sample ID to Three or more, the following error occurs:

(Demultiplex) u67@hoff:~$ demuxlet --sam ~/tmp.storage/Sca_GEX_1_2023-9.bam --vcf ~/tmp.storage/all.coral.ID.vcf --sm 01 --sm 02 --sm 03 --out ~/demuxlet.result.di/Sca_GEX_1.demuxlet --group-list ~/tmp.storage/Group1_0min.tsv --field GT

Available Options

The following parameters are available. Ones with "[]" are in effect:
   Options for input SAM/BAM/CRAM : --sam [/mnt/data/dayhoff/home/u6798856/tmp.storage/Sca_GEX_1_2023-9.bam],
                                    --tag-group [CB], --tag-UMI [UB]
        Options for input VCF/BCF : --vcf [/mnt/data/dayhoff/home/u6798856/tmp.storage/all.coral.ID.vcf],
                                    --field [GT], --geno-error [0.01],
                                    --min-mac [1], --min-callrate [0.50],
                                    --sm [01, 02, 03], --sm-list
                   Output Options : --out [/mnt/data/dayhoff/home/u6798856/demuxlet.result.di/Sca_GEX_1.demuxlet],
                                    --alpha, --write-pair,
                                    --doublet-prior [0.50],
                                    --sam-verbose [1000000],
                                    --vcf-verbose [10000]
           Read filtering Options : --cap-BQ [40], --min-BQ [13],
                                    --min-MQ [20], --min-TD,
                                    --excl-flag [3844]
   Cell/droplet filtering options : --group-list [/mnt/data/dayhoff/home/u6798856/tmp.storage/Group1_0min.tsv],
                                    --min-total, --min-uniq, --min-snp

Run with --help for more detailed help messages of each argument.

NOTICE [2024/08/21 18:40:50] - Finished loading 2499 droplet/cell barcodes to consider
NOTICE [2024/08/21 18:40:50] - Finished identifying 3 samples to load from VCF/BCF

FATAL ERROR -
[E:int32_t main(int32_t, char**) Cannot read any single variant from /mnt/data/dayhoff/home/u6798856/tmp.storage/all.coral.ID.vcf]

terminate called after throwing an instance of 'pException'
  what():  Exception was thrown
Aborted (core dumped)

The error says that it cannot read any variant in the vcf file. I also tried replacing --sm with --sm-list: any two sample IDs work, but more than two will produce the same error as above.

The package's sample dataset, which has 13 sample IDs, works fine on my server. I don't understand why this error occurs in my dataset and how to avoid it. If you know where the problem is, please let me know. Thanks.

SNP variant demultiplexing Demuxlet scRNAseq • 244 views
ADD COMMENT
7
Entering edit mode
3 months ago
Chilly ★ 1.3k

Question solved. just add --min-callrate 0 and everything would work. Cheers.

ADD COMMENT

Login before adding your answer.

Traffic: 1494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6