Filtering multi-allelic sites in VCF files
1
4
Entering edit mode
4.0 years ago
ozankiratli ▴ 150

I am trying to filter multi-allelic sites in a VCF file to find the true multi allelic sites. Some of these have 1 read in 200, that I want to get rid of. The other alternative looks good. However, when I try to filter the alleles lower than frequency (0.05) with vcftools it gets rid of all variant not just the minor allele. Is there a way to filter out the minor alternate according to the frequency but not the variant completely?

SNP variant caliing VCF multiallelic sites • 7.5k views
ADD COMMENT
1
Entering edit mode

Check out bcftools view -i/-e expressions. The boolean operators might be helpful in navigating your niche requirements.

ADD REPLY
0
Entering edit mode

I tried but both bcftools and vcffilter, they put a cutoff value, then exclude the site all together, I want to keep the site and get rid of the extra allele. This is for poolseq analysis, that's why the read depths matter.

ADD REPLY
0
Entering edit mode

If it is an uncommon operation, you might need to do some manipulation with awk. You could also look for bcftools plugins, but I'm not sure if there's some sort of a library of plugins you could search.

ADD REPLY
4
Entering edit mode
3.9 years ago
ozankiratli ▴ 150

Found the answer!

  1. Convert multiallelic to biallelic vcf first

     bcftools norm -m - file.vcf  > biallellic.vcf
    
  2. Filter the alternative alleles under certain value

     bcftools view -e "FORMAT/AD[:1]<2 && INFO/AD[1]<5" biallelic.vcf > biallelic-filtered.vcf
    
  3. Convert biallelic vcf to multiallelic vcf

     bcftools norm -m + biallelic-filtered.vcf > multiallellic-filtered.vcf
    
ADD COMMENT
0
Entering edit mode

Note that this effectively replaces all observations of rare alleles with observations of reference alleles, when it's usually more appropriate to replace the affected genotypes with "./.".

ADD REPLY
0
Entering edit mode

That's correct, but depends on the application! I use this for poolseq. So I really don't care about the genotypes. All I need is the allele frequencies. And for my data, I'm pretty confident that those are sequencing errors. I already remove anything below 5%. This was another step to do it.

ADD REPLY
0
Entering edit mode

You almost certainly should be excluding the affected samples from your allele frequency denominators. Your approach does not do that.

ADD REPLY
0
Entering edit mode

Can you explain why?
The filtering step can be different for different people, in my case I have 5 samples, and I am excluding the if an alternative allele is lower than 2 in any sample and the count of it is lower than 5 in all samples. Also the rest of my filtering is not included here.

ADD REPLY
0
Entering edit mode

Ok, sorry, I should have looked more carefully at the rest of the post. If these thrown-out ALT alleles really are just sequencing errors, this approach is ok. My comments were directed at the case where the ALT alleles were rare but not errors.

ADD REPLY

Login before adding your answer.

Traffic: 2189 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6