Question

Mpileup -D And -S

1

Entering edit mode

12.0 years ago

GPR ▴ 390

I am calling out SNPs from a high coverage (> 20fold) 75bp paired-end, > 400 million reads data set. I have used a standard BWA>MARK/REMOVE DUPLICARES (PICARD)>SAMTools flowthrough. My samtools mpileup command line is: samtools mpileup -uf genome.fa input-1.bam input-2.bam | bcftools view -bvcg -.> output.bcf bcftools view output.bcf | vcfutils.pl varFilter -D100 > varflt.vcf I did however realize that for high coverage,I should use the variable -S. My question is, is -D100 an ideal threshold? Regarding -S, what's an appropriate/standard value? Many thanks.

mpileup vcf • 3.1k views

ADD COMMENT • link updated 12.0 years ago by swbarnes2 14k • written 12.0 years ago by GPR ▴ 390

Istvan Albert · Answer 1 · 2012-11-12

0

Entering edit mode

12.0 years ago

Ian 6.1k

In regards to vcfutil.pl varfilter -D I also asked a question about this:

How to interpret -D file in samtools variant calling pipeline?

The thing to be clear about is that any position that exceeds the coverage of '-D' will be excluded from the results. When i run variant detection i usually set '-D' to be 3-5x the mean depth of coverage for the genome. This seems to work well, but still excludes areas where there is artificially high coverage.

Not sure what you mean about '-S'? For which, program?

ADD COMMENT • link updated 12.0 years ago by Istvan Albert 101k • written 12.0 years ago by Ian 6.1k

0

Entering edit mode

Thanks, got a clear idea regarding -D. -S is strand bias p-value in bcftools. I want to add this flag in: bcftools view *.bcf | vcfutils.pl varFilter -D1000(or another value -S (which value???) > output.vcf Since -S is a p-value, I assume 0.05-0.1 would be reasonable, but want to make sure. Thanks.

ADD REPLY • link 12.0 years ago by GPR ▴ 390

score 0 · Answer 2 · 2012-11-12

As Ian answered I would set -D to be 3-5 times of the mean coverage. I usually keep it 3X. This way you would discard false postitive SNPs resulting in the CNV regions OR highly similiar regions like paralogs. -S is a p-value but I usually dont care a much about it and use a very low cut off .0001. There have been many papers those have used the same p-value for strand bias.

score 0 · Answer 3 · 2012-11-12

0

Entering edit mode

12.0 years ago

swbarnes2 14k

I wouldn't call 20x high coverage. I'd say high coverage is 100x, at least.

ADD COMMENT • link 12.0 years ago by swbarnes2 14k