Question

Maximum Read Depth

1

Entering edit mode

12.9 years ago

Anjali ▴ 60

Hi,

I wish to get the variants of my sequencing data using Samtools (vcfutils). For this I need to specify the maximum read depth in VarFilter function. I have no information of this from my Data. Can anybody tell me, how should I approach to its calculations, or considering default values to obtain SNPs in vcf format??

Thanking you

vcf read samtools • 5.6k views

ADD COMMENT • link updated 18 months ago by chrisgr ▴ 20 • written 12.9 years ago by Anjali ▴ 60

score 1 · Answer 1 · 2012-01-16

1

Entering edit mode

12.9 years ago

Doctoroots ▴ 800

generally, maximal coverage for variant calling is set out of memory considerations, for this purpose i think the practice recommended in the GATK Unified Genotyper should be appropriate:

"...When running on projects with many samples at low coverage (e.g. 1000 Genomes with 4x coverage per sample) we usually lower this [the maximal coverage] value to about 10 times the average coverage (40x)..."

good luck

ADD COMMENT • link 12.9 years ago by Doctoroots ▴ 800

0

Entering edit mode

Is the average coverage the average of the depths of all the variants in a VCF file? Or the average coverage of every mapped base in the exome/genome?

ADD REPLY • link 18 months ago by chrisgr ▴ 20

Ram · Answer 2 · 2012-01-16

I have previously considered this. The '-D' flag allows the program to avoid regions that have abnormally high coverage, e.g. PCR amplification errors. I calculated the mean/median coverage for the sample using BEDTools, which was also covered a previous Biostars question.

genomeCoverageBed -d -ibam file_sorted.bam -g genome_seq.fasta > file_sorted.1bp_coverage_inc0

This method does includes 0 counts for regions not covered by reads.

You can put the output in R or i used the Perl script at the bottom of the above link by Heikki.

To set '-D' i used a value 3 to 5 x that of the mean coverage, you'll to play with this to get the best result for your samples. I saw very little difference in the number of SNPs using any multiplier >3.

Bear in mind SNPs will not be reported for those regions with coverage > '-D'. Setting the value too high will result in too many false positives.

Also it is hightly recommended to locally realign reads to avoid false positive SNPs due to the presence of INDELs, e.g. SRMA or use of the GATK suite.

score 0 · Answer 3 · 2012-01-16

0

Entering edit mode

12.9 years ago

Pascal ★ 1.5k

Why don't you just use it without setting this parameter. There is a default parameter value set to 1000.

ADD COMMENT • link 12.9 years ago by Pascal ★ 1.5k