Hi everyone,
I have BAM files of some ancient genomes (for example: 17X), authors mentioned in the paper that they genotyped all genomes individually using samtools (v1.3.1) mpileup –C50 and bcftools (v1.3.1) using the consensus caller. Briefly, the steps are as follows:
- Run samtools mpileup with the -C parameter set to 50 and only included sites with a depth of coverage greater than 1/3 the average depth and lower than 2 times the average depth.
- Filtered out variant calls with a phred posterior probability lower than 30.
- Filtered out variants with significant strand and/or end distance bias (p<1e-4) or located within 5 bp of each other.
- Discarded heterozygote calls if allelic balance for the minor allele was less than 0.25.
My initial command-line is:
samtools mpileup -C50 -uf ref.fa aln.bam | bcftools call -cO v -o study.vcf
I have little idea to design the simple samtools command-line with bcftools piping consensus caller but don't know how to apply filters.. can anyone please help me in command-line design.
Thanks a lot!
-sohail
Thanks @kevin for your kind response..
it's really helpful. however, i have some confusions. The steps that you mentioned, have you tested while analyzing ancient genomes data?
After reading your guidelines i designed my command-line (based on the criteria that i mentioned in question previously) for ancient BAM samples as follows:
STEP1: I am interested in calling all sites variants/non-variant omitting --variants-only option in BCFTOOLS Since i am using single sample BAM omitting --per-sample-mF
STEP2 and STEP3 as you suggested.
what does this -m-any used for in step 2 and what is --pval-threshold 1.0 in STEP 1 ?
In STEP4 i am still not sure how apply the following filters:
Why you set most of the options 0.05 in vcfutils.pl command? can you please guide me in applying above mentioned filters? or if you have tested your set parameters performed better in ancient DNA analysis, please share your experience.
Thanks a lot!
Hey SOHAIL, this has not been tested on ancient DNA, however, it has been tested rigorously in a clinical genetics laboratory, so, on varying tissue types and of varying qualities. It is tailoured for detecting germline calls, but it can still identify somatic and low frequency variants (like one may expect from circulating free DNA, FFPE DNA, or ancient DNA) when the parameters are altered.
In STEP 2 / 3,
-m-any
left-aligns indels and splits multi-allelic calls.--pval-threshold
is used with--consensus-caller
and is, to put it simply, the probabiltiy threshold in order to break the null hypothesis that the alternate allele is the reference allele. When it is set to 1.0, the threshold is relaxed, therefore, more variants will be called (but these will be less reliable calls). Sometimes this is necessary, and I have detected even Sanger-confirmed variants this way.For STEP4,
I believe that this is mentioned here: SAMtools/BCFtools specific information (please \read through the various flags)
-1 FLOAT min P-value for strand bias (given PV4) [0.0001]
You can just do this with BCFtools (
bcftools view
, I believe), by filtering on the AF (allele frequency) tag.Thanks!