I am trying different things in Samtools to get a reasonable .vcf file for 2 human exomes I have sequenced.
Whatever I do I keep getting a file of ~600Mb which has millions of rows and can't fully open in Excel. I would expect ~22,000 variant calls per person so I assume this file is filled with sequencing blips that aren't genuine variants.
The method I have tried are making a pileup and converting a bcf to vcf via the varFilter command:
samtools mpileup -g -f hg19.fa aln.srt.bam > raw.bcf
bcftools view -vcg raw.bcf > var.bcf
bcftools view var.bcf | perl vcfutils.pl varFilter > var-final.vcf
I have also tried this:
samtools mpileup -uf hg19.fa srt.bam | bcftools call -mv - > file.vcf
Can anyone tell me how to best call rare variants in Samtools?
Many thanks
Just because you got more than expected doesn't mean they are fake. Look into the data and see what is present! Where did you get the idea people carry 22,000 variants?
Your samtools command is fine, other tools will make other lists of variants.
Samtools doesn't know what a rare variant is, you'll have to define rare and do some post-filtering. Maybe you want nonsynonymous exonic variants only?
I think OP needs to follow GATK Best Practices, in that OP needs to annotate variants and then filter by functionality, region, MAF to get to their desired result.
Hi thanks for your reply, I have looked at the data and that is why I think they are fake. I am getting millions of variant calls, the literature suggests and general is that the average exome carries around 22,000 variants (this includes synonymous changes). I also have experience analysing exome data once the variants have been called (by someone else). This is why I was hoping for something similar. I thought using varFilter would help select genuine variants, and I have also used
-D
to change the depth of the call I want to keep but nothing changed in my output.I was putting off using GATK because I didn't find it user friendly, not that any of these tools are to someone who has no experience at this! You're right, I do want variants that are nonsyn and exonic. I also want to scream into a pillow.
If the sample was good and the sequencing was done to a sufficient depth, variants can rarely be fake. Your target will need filtering to be found. Try picking only exonic variants - and remember, samtools is really sensitive and good for low coverage data, but GATK is better for high coverage human data. Maybe an expert on samtools can give you better inputs on optimizing it for your case.
GATK might look like it is a bit more complicated, but it is definitely so for a reason. It has a ton of downstream analysis tools and the best documentation I've seen yet. You'd benefit by giving it a go. Most problems you ever face in GATK will have a straightforward solution already discussed on their forums.
Are these millions of sites really variant from reference? Maybe you're seeing all callable sites including those homozygous-reference.
The other possibility is that the BAM is not aligned to this hg19.fa, so you will get basically every site called as mutant. You have to be sure the BAM is aligned to the same reference file as you call variants with. Maybe you can post some of the VCF you think is wrong?
(a lot of GATK's unfriendliness is to prevent these mistakes)