I am working on VCF dorument and there are many indels and multiallelic in my VCF. However, it is not necessary for me to includ them in our analysis and I would like to remove them. Is there a simple way to do that with vcftools or bcftools?
You can filter anything you want using bcftools view. In your case, if you want to filter out indels and multiallelic, you would need something like this:
be careful with this command since it gives biallelic SNPs only. if you want to extract all SNPs with AT MOST 2 alleles, (which is not strictly the same as BIALLELIC, then you need to remove the -m2 option.
strictly speaking all snps are at least biallelic, since the variant detection implies that there's other allele apart from the reference one, so you shouldn't worry if you're working with variants only. the vcf format allows you to define positions where you may have a reference allele but not a alternative allele, and those ones would be removed on the first code. this is indeed not common at all, but you always have to keep in mind what you're filtering out, so removing the -m2 is a way not to worry about that possible issue.
so if it isn't working it must be either a problem with bcftools (that'd be odd) or either a problem with gnomad data.
if you query gnomad data for that particular position you'll see that there are 2 entries instead of 1, each one with 2 alleles maximum, so the -M2 filter from bcftools won't be able to filter them:
note that the merging process when normalizing can be quite intense on such large files as gnomad's, so if you are interested in particular regions do not forget to indicate this in your normalization command. I've edited my answer to reflect this.
I have always used vcf.gz, although I've seen vcf.bgz probably to indicate that the compression has been performed with the bgzip algorithm rather than simply gzip. but this is simply informative for the human user, not for the software.
if($5 !~ /,/ && length($5)==1 && length($4)==1){print} ---> If ALT column (5 column) does not contain a comma (there aren't ALT alleles), and REF ($4) and ALT ($5) columns have only one letter (SNP) print it.
Another suggestion is bcftools. bcftools has function of 'norm'.
It can be used with indel.
==
bcftools norm [OPTIONS] file.vcf.gz
Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows.
I was trying to do the same with all the 22 VCFs of the 1000 genome phase 3 datasets (~15G compressed) concatenated in a single VCF, but with bcftools was taking hours (more than 10 hours and still running). This is the command I was trying:
Really great and I used the second one since biallelic snps are we need.
be careful with this command since it gives biallelic SNPs only. if you want to extract all SNPs with AT MOST 2 alleles, (which is not strictly the same as BIALLELIC, then you need to remove the
-m2
option.Sorry. Do you mind to give me a small example of between biallelic snps with snps at most 2 alleles? Thanks and with best regards, Xinhui
strictly speaking all snps are at least biallelic, since the variant detection implies that there's other allele apart from the reference one, so you shouldn't worry if you're working with variants only. the vcf format allows you to define positions where you may have a reference allele but not a alternative allele, and those ones would be removed on the first code. this is indeed not common at all, but you always have to keep in mind what you're filtering out, so removing the
-m2
is a way not to worry about that possible issue.Not works well for me.
rs1799966
is tri-allelic, however,rs1799966
still left after-m2 -M2 -v snps
bcftools' documentation is very clear about this
so if it isn't working it must be either a problem with
bcftools
(that'd be odd) or either a problem with gnomad data.if you query gnomad data for that particular position you'll see that there are 2 entries instead of 1, each one with 2 alleles maximum, so the -M2 filter from bcftools won't be able to filter them:
if you want to do so, you'll have to join biallelic sites into multiallelic records in advance, using
bcftools norm -m+
for instance:and then try to filter your region of interest (I'm using 17:41223090-41223100 here as an example):
EDIT: first
bcftools view
is not really needed, sincebcftools norm
is already capable of slicing the data by region of interest:Yes. You are exactly right. That's the point I figure out later.
Another thing should be mentioned should be sort ? anyway, now I will sort the vcf first and then use -m2 -M2
note that the merging process when normalizing can be quite intense on such large files as gnomad's, so if you are interested in particular regions do not forget to indicate this in your normalization command. I've edited my answer to reflect this.
Hi Jorge, What kind of suffix would be preferred when we use bcftools view -Ou -o ? vcf.gz ? or bcf or vcf.bgz?
I have always used vcf.gz, although I've seen vcf.bgz probably to indicate that the compression has been performed with the
bgzip
algorithm rather than simplygzip
. but this is simply informative for the human user, not for the software.