Edit (12th February 2020):
The word 'duplicate' can mean different things:
- duplicate variant ID
- duplicate position
- duplicate ref > alt change
It is important that you clarify what it is that you are regarding as duplicated. In most situations, it will relate to duplicate IDs, so, that is the main focus of this answer.
Also note that you can control the output format from most BCFtools commands with -O
/ --output-type
(you have not specified this in your commands):
-O, --output-type <type> 'b' compressed BCF; 'u' uncompressed BCF;
'z' compressed VCF; 'v' uncompressed VCF [v]
Solution 1 - split-multi-alleles, left-align indels, and resetting the ID to something unique
Before removing duplicates based on anything, it may be advisable to first normalise your file via:
- splitting multi-allelic calls
- left-aligning indels
.
bcftools norm -m-any myfile.vcf.gz | \
bcftools norm --check-ref w -f human_g1k_v37.fasta -Ob > out.bcf ;
bcftools index out.bcf ;
Note that you require a reference FASTA for the second command (which left-aligns indels).
Ironically, splitting multi-allelic sites through normalisation can result in new duplicate IDs, on top of those that may already have been there. You can therefore set the ID field to something unique, like this:
bcftools annotate -Ob -x 'ID' -I +'CHROM:%POS:%REF:%ALT'
This should mitigate most issues, but you lose the original ID; however, you get to retain / keep all variants that were in the original file.
---------------------------------------
Solution 2 - create / permit multi-allelic sites
If your initial problem is duplicate IDs, a much safer solution is to simply merge all multi-allelic 'duplicate' calls together. This also preserves all information that you had in your previous file but may be problematic if you don't want multi-allelic calls (some programs don't like working with multi-alleles):
bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf
bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs10152211"
15 27221345 rs10152211 C G,T 100 PASS
bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs115550614"
15 34708366 rs115550614 G A,C 100 PASS
---------------------------------------
Solution 3 - awk
If you literally want to remove records with a duplicate ID, then this works; however, it will only keep the first match, which may be dangerous if it's not the variant allele of interest:
bcftools view myfile.bcf | awk '/^#/{print}; !/^#/{if (!uniq[$3]++) print}'
---------------------------------------
Solution 4 - --rm-dup
Just use the --rm-dup
flag from bcftools norm
and many other BCFtools commands. I don't know how this functions, exactly, but it may have the same effect as the awk
command.
With that, hopefully you can see my original point about first clarifying what it is that you are regarding as duplicated.
Kevin
Thank you.
What is the purpose of normalisation? I am working on it. I download the human_g1k_v37.fasta.gz file from the 1000G FTP. But I cannot uncompress it to .fasta, my Mac return Error 32 - Broken pipe.
It ensures that multi-allelic sites are split into multiple rows in the VCF, and also that your ref alleles match the sequence in the reference genome FASTA. It may or may not be important for your work.
The reference FASTA has to be uncompressed, not as gzip.
I successfully normalize my file and got the bcf file. But when I apply the below command to remove duplicate, it returns error. "
-D
is functional but deprecated, replaced by-d both
."So I tried this, but it does not remove duplicate snps
Can you possibly paste an example of your VCF here so that I can test?
The original VCF is huge. I share this bcf file I got in the below link. And also where the original VCF is from in the word documents. Thank you so much.
https://www.dropbox.com/sh/sw781hk4n7t1ntv/AABYfT80cBMSFuzKooE_1rKea?dl=0
Thanks for sharing! I'm now beginning to understand why it's not removing these 'duplicate' rs IDs. It's due to the behaviour of
BCFtools
. There is an indirect way to do it withBCFtools
, but also another solution withawk
.As you mentioned, many of the records have the same rs ID but alternate bases, which is perfectly reasonable (a rs ID can have multiple variant alleles). If we take a look at some of these:
I did some exploring and found the duplicate IDs with this command:
Solution 1
as above in edited original answer
Solution 2
as above in edited original answer
Thank you so much!
I now use the below command to normalize and merge the multi-allelic, and then save as bcf. An updated bcf file is available here: https://www.dropbox.com/sh/sw781hk4n7t1ntv/AABYfT80cBMSFuzKooE_1rKea?dl=0
But when I check the duplicate SNPs using the below command. There are a few duplicate SNPs remained. I think these were not merged because they have different bp? or they have same alleles?
bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | cut -f3 | grep -e "^rs" | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'
Therefore, for these SNPs I would like to use the Solution 1 you mentioned to keep only one call. I made some revision because I want to save the result as rmvdup.vcf, while your command above applies "bcftools view" to print the result. But then I got stuck here, I tried to place "-Ou" at different position, but the command cannot run, and no vcf was saved.
Okay, this situation is bizarre! I have never seen this before where the same rs ID has different positions in a 1000 Genomes file - I think that these are errors but I cannot be sure.
I used this command to find the SNPs that behave this way:
The way to combine my 2 solutions together is as follows (note that the
bcftools annotate
command is not necessary):The only problem is that, of the 3 duplicate SNPs that remained, we now have rs542232278 at the incorrect position. I checked each of these SNPs in the UCSC genome browser and the correct positions are:
If you want to ensure that you have no duplicates and also have the correct SNP position for the 3 strange duplicates, then the following works:
It may be very annoying doing this for each chromosome but, hey, this is the way that bioinformatics is!
Note that the
--check-ref
parameter should warn when a reference allele in the VCF is not compatible with the reference FASTA. However, the 3 strange duplicates avoid this because the reference base at each coincidentally matches the base in the reference genome too.Regarding
bcftools annotate
, it is useful when you want to change the value in the ID field. For example, rs IDs can be difficult to work with, so, to make variant truly unique could could change the VCF ID field to chr:pos:ref:alt with:Here is a possible explanation of Multiple Mapping Positions for a Single SNP
I think I should consider changing the ID to '%CHROM:%POS:%ID'. So I used this command to merge multi-allelic, to normalize, and to replace the ID with chr:pos:id.
chr15_1000Gphase3_EUR_snp_maf.bcf is available here
And then I used this command to check if any duplicate exists in the new ID column
And this command to check the duplicate rs145926341, and now if I use the new ID column (chr:pos:rsid), these two records are different.
Thank you so much again.
Looks great! - thanks. No problem.
Thanks regarding the link to - very interesting!
hello, i want to ask a question about duplicate SNP if they in different rows with the same ID ,POS and REF, but alt and corresponding AD and AF is different, they are multiallelic variants?