Hi al,
I'm looking to parallelise my variant-calling using mpileup on a per-chromosome basis as to run in serial for a whole genome takes many, many days.
However, I've noticed that I get different variants called in the two separate analyses. Which is a bit disconcerting!
As a test, I ran the following two commands on chromosome Y for a pair of sample;
samtools mpileup -r chrY:50000000-60000000 -u -q 0 -f homo_sapiens.fa tumour.bam normal.bam | bcftools/bcftools view -bcvg - > chrY.sub.bcf
bcftools/bcftools view chrY.sub.bcf > chrY.sub.vcf
samtools mpileup -r chrY -u -q 0 -f homo_sapiens.fa tumour.bam normal.bam | bcftools view -bcg - > norecal.chrY.full.allPos.bcf
bcftools view chrY.full.allPos.bcf > chrY.full.allPos.vcf
There are 25 variants in the subset that are not called in the full-chromosome analysis. The calls look to be poor quality, so I guess I'm not too worried about them being lost in the analysis of the whole chromosome. But I still would like to get to the bottom of the differences.
They all seem to be quite high-depth, so I'm wondering if mpileup is doing down-sampling in the variant calling, which leads to different reads being used in the analysis each time? Would be BAQ calculation also depend on the total number of reads that you had.
Here I show output for two discrepant calls.
Variant call in the subset
chrY 58828570 . C A 4.15 . DP=1626;VDB=0.0394;AF1=1;AC1=4;DP4=149,125,620,594;MQ=0;FQ=-52.1;PV4=0.35,0,1,1 GT:PL:GQ 1/1:15,255,0:19 1/1:23,255,0:27
Output in the full-chromosome example
chrY 58828570 . C . 3.52 . DP=1626;VDB=0.0394;AF1=1;AC1=4;DP4=149,125,620,594;MQ=0;FQ=-52;PV4=0.35,0,1,1 PL 19 16
Variant call in the subset
chrY 58839518 . G T 3.56 . DP=886;VDB=0.0020;AF1=1;AC1=4;DP4=90,97,289,343;MQ=0;FQ=-52;PV4=0.62,1.8e-287,1,0.0042 GT:PL:GQ 1/1:22,238,0:24 1/1:15,202,0:17
Output in the full-chromosome example
chrY 58839518 . G . 5.41 . DP=886;VDB=0.0020;AF1=1;AC1=4;DP4=90,97,289,343;MQ=0;FQ=-50.7;PV4=0.62,1.8e-287,1,0.0042 PL 17 15
can you post the actual commands you used? in the first command, you're sending a .bcf file in after the -f flag. that should be a fasta.