How to filter SNPs after Freebayes, if at all, in a fungal haploid genome alignment?
0
0
Entering edit mode
7.2 years ago
Lina F ▴ 200

Hi all,

I aligned Illumina reads of a haploid fungal genome against a reference sequence and then called SNPs with Freebayes:

/home/ubuntu/freebayes/bin/freebayes -f reference.fasta alignment.bam --ploidy=1 --min-alternate-qsum 30 -F 0.05 > snps.vcf

I then followed this tutorial to filter the output of freebayes further using these commands with vcffilter:

  • -f “SRP > 20” (Strand balance probability for the reference allele)
  • -f “SAP > 20” (Strand balance probability for the alternate allele)
  • -f “EPP > 20” (End Placement Probability)
  • -f “QUAL > 30” (phred scaled variant quality)
  • -f “DP > 100” (depth)

This reduced the SNP count from about 1,700 to about 15.

My questions are as follows:

  1. Is there a set of filtering techniques you start with that you then tweak for different applications?
  2. This filtering strategy is obviously somewhat stringent, since the SNP calls were significantly reduced -- is this too aggressive?

Thanks for any advice!

freebayes SNP filtering • 4.9k views
ADD COMMENT
1
Entering edit mode

That doesn't really make much sense to me... why are you requiring depth > 100, which seems very extreme? Do you even have 100x average coverage? Also, why "-F 0.05" - what exactly are you looking for? I assume that flag tells Freebayes to call variations down to 5% allele frequency.

I'd look at your bam in IGV and see if the variations getting filtered out look real or not.

ADD REPLY
0
Entering edit mode

I have >100X coverage, so I should be ok on that front. I will take a look at IGV.

Ultimately, I would like to script this pipeline so that I don't have to look at all genomes manually but maybe there is no way around that.

ADD REPLY
0
Entering edit mode

Coverage is pretty variable so even if the average is, say, 300x I still don't see a good reason to filter out variants with coverage under 100x. Maybe 10x.

The goal of looking at it manually in this case is not to manually curate all genomes you process, just to find settings that work. Once you identify which variations look real to you that are being removed, you can figure out which filter they are failing and adjust that parameter. If you are looking for germline variations in a haploid, your initial frequency cutoff of 0.05 should probably be increased a lot, perhaps to 0.2, to avoid false positives.

ADD REPLY

Login before adding your answer.

Traffic: 1918 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6