Dear all,
Today I'm trying to clarify my understanding of --max-depth
argument from bcftools mpileup
command, which is currently described as:
At a position, read maximally INT reads per input file. Note that the original samtools mpileup command had a minimum value of 8000/n where n was the number of input files given to mpileup. This means that in samtools mpileup the default was highly likely to be increased and the -d parameter would have an effect only once above the cross-sample minimum of 8000. This behavior was problematic when working with a combination of single- and multi-sample bams, therefore in bcftools mpileup the user is given the full control (and responsibility), and an informative message is printed instead [250]
My question is: I get that this argument sets the maximum number of reads used to study each position per input file, but is there any minimum threshold (e.g. 100) below which user-specified value has no effects?
For example, I'm currently using bcftools' mpileup
+ call
pipeline for variant calling of a 30x WGS sample:
$ fasta=/Users/sbslee/OneDrive/TestFiles/genome.fa
$ bam=/Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bam
$ region=chr2:234663648-234663651
$ bcftools mpileup -Ou -f $fasta -r $region $bam | bcftools call -mv -Ov
Which gives:
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[W::hts_idx_load3] The index file is older than the data file: /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bai
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 250
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.15+htslib-1.15
##bcftoolsCommand=mpileup -Ou -f /Users/sbslee/OneDrive/TestFiles/genome.fa -r chr2:234663648-234663651 /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bam
##reference=file:///Users/sbslee/OneDrive/TestFiles/genome.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.15+htslib-1.15
##bcftools_callCommand=call -mv -Ov; Date=Sat Apr 2 16:16:56 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA18519_PyPGx
chr2 234663649 . C G 225.417 . DP=35;VDB=0.241828;SGB=-0.693136;MQSBZ=0;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,17,18;MQ=60 GT:PL 1/1:255,105,0
Because I didn't change --max-depth
, mpileup
used the default value of 250 and the sample had a total of 35 reads for chr2-234663649-C-G
(i.e. DP=35
).
Now, by setting --max-depth 10
I would expect the variant to be still called but with DP=10
instead of DP=35
. Let's run this:
$ bcftools mpileup -Ou -f $fasta -r $region --max-depth 10 $bam | bcftools call -mv -Ov
Which gives:
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[W::hts_idx_load3] The index file is older than the data file: /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bai
[mpileup] 1 samples in 1 input files
[mpileup] maximum number of reads per input file set to -d 10
Note: The maximum per-sample depth with -d 10 is 10.0x
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.15+htslib-1.15
##bcftoolsCommand=mpileup -Ou -f /Users/sbslee/OneDrive/TestFiles/genome.fa -r chr2:234663648-234663651 -d 10 /Users/sbslee/OneDrive/PyPGx/GeT-RM/grch37-bam/NA18519_PyPGx.sorted.markdup.recal.bam
##reference=file:///Users/sbslee/OneDrive/TestFiles/genome.fa
##contig=<ID=chrM,length=16571>
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##ALT=<ID=*,Description="Represents allele(s) other than observed.">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=IDV,Number=1,Type=Integer,Description="Maximum number of raw reads supporting an indel">
##INFO=<ID=IMF,Number=1,Type=Float,Description="Maximum fraction of raw reads supporting an indel">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias for filtering splice-site artefacts in RNA-seq data (bigger is better)",Version="3">
##INFO=<ID=RPBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Read Position Bias (closer to 0 is better)">
##INFO=<ID=MQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality Bias (closer to 0 is better)">
##INFO=<ID=BQBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Base Quality Bias (closer to 0 is better)">
##INFO=<ID=MQSBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Mapping Quality vs Strand Bias (closer to 0 is better)">
##INFO=<ID=SCBZ,Number=1,Type=Float,Description="Mann-Whitney U-z test of Soft-Clip Length Bias (closer to 0 is better)">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=SGB,Number=1,Type=Float,Description="Segregation based metric.">
##INFO=<ID=MQ0F,Number=1,Type=Float,Description="Fraction of MQ0 reads (smaller is better)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Average mapping quality">
##bcftools_callVersion=1.15+htslib-1.15
##bcftools_callCommand=call -mv -Ov; Date=Sat Apr 2 16:30:10 2022
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA18519_PyPGx
chr2 234663649 . C G 225.417 . DP=33;VDB=0.184924;SGB=-0.693127;MQSBZ=0;FS=0;MQ0F=0;AC=2;AN=2;DP4=0,0,16,17;MQ=60 GT:PL 1/1:255,99,0
As you can see, the total depth decreased from 35 to 33 but it's nowhere close to our expected 10. Can someone please enlighten me what I'm missing here?
FYI, I'm using the latest version of bcftools available in bioconda (1.15):
$ conda list | grep "bcftools"
bcftools 1.15 hd1821b6_2 bioconda
P.S. This question has been cross-posted onto bcftools GitHub repository issue.
In general, please don't crosspost — it just leads to duplication of effort as different people answer in different forums.
In particular, please don't crosspost between this Q&A forum and the tool's GitHub repository. Unless you are sure you have encountered a specific bug in the tool, if you are considering whether to post a question here or as an issue in the tool's repository, it is always more beneficial to post as a question here in a Q&A forum with a wider audience.