Remove duplicate SNPs only based on SNP ID in bcftools
1
3
Entering edit mode
7.3 years ago
janhuang.cn ▴ 230

I want to remove duplicate SNPs only according to SNP ID (e.g. rs123), from a vcf file.

I used the below command, but it duplicate SNPs still exist. Because they have different bp or allele. (I want to keep only the unique SNP ID)

bcftools norm --remove-duplicates --output file_rmvdup.vcf.gz file.vcf.gz

Then I tried the below command. But It says my vcf is not compressed.

bcftools concat --rm-dup all -a --output file_rmvdup.vcf.gz  file.vcf.gz

So I generate a compressed vcf file, using the below command.

bcftools view file.vcf.gz -Oz -o compressed_file.vcf.gz

But when I run the below command, it returns "could not load index"

bcftools concat --rm-dup all -a --output file_rmvdup.vcf.gz  compressed_file.vcf.gz
SNP duplicate bcftools • 12k views
ADD COMMENT
12
Entering edit mode
7.3 years ago

Edit (12th February 2020):

The word 'duplicate' can mean different things:

  • duplicate variant ID
  • duplicate position
  • duplicate ref > alt change

It is important that you clarify what it is that you are regarding as duplicated. In most situations, it will relate to duplicate IDs, so, that is the main focus of this answer.

Also note that you can control the output format from most BCFtools commands with -O / --output-type (you have not specified this in your commands):

-O, --output-type <type>    'b' compressed BCF; 'u' uncompressed BCF; 
                            'z' compressed VCF; 'v' uncompressed VCF [v]

Solution 1 - split-multi-alleles, left-align indels, and resetting the ID to something unique

Before removing duplicates based on anything, it may be advisable to first normalise your file via:

  1. splitting multi-allelic calls
  2. left-aligning indels

.

bcftools norm -m-any myfile.vcf.gz | \
  bcftools norm --check-ref w -f human_g1k_v37.fasta -Ob > out.bcf ;

bcftools index out.bcf ;

Note that you require a reference FASTA for the second command (which left-aligns indels).

Ironically, splitting multi-allelic sites through normalisation can result in new duplicate IDs, on top of those that may already have been there. You can therefore set the ID field to something unique, like this:

bcftools annotate -Ob -x 'ID' -I +'CHROM:%POS:%REF:%ALT'

This should mitigate most issues, but you lose the original ID; however, you get to retain / keep all variants that were in the original file.

---------------------------------------

Solution 2 - create / permit multi-allelic sites

If your initial problem is duplicate IDs, a much safer solution is to simply merge all multi-allelic 'duplicate' calls together. This also preserves all information that you had in your previous file but may be problematic if you don't want multi-allelic calls (some programs don't like working with multi-alleles):

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs10152211"
15  27221345    rs10152211  C   G,T 100 PASS

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs115550614"
15  34708366    rs115550614 G   A,C 100 PASS

---------------------------------------

Solution 3 - awk

If you literally want to remove records with a duplicate ID, then this works; however, it will only keep the first match, which may be dangerous if it's not the variant allele of interest:

bcftools view myfile.bcf | awk '/^#/{print}; !/^#/{if (!uniq[$3]++) print}'

---------------------------------------

Solution 4 - --rm-dup

Just use the --rm-dup flag from bcftools norm and many other BCFtools commands. I don't know how this functions, exactly, but it may have the same effect as the awk command.


With that, hopefully you can see my original point about first clarifying what it is that you are regarding as duplicated.

Kevin

ADD COMMENT
0
Entering edit mode

Thank you.

What is the purpose of normalisation? I am working on it. I download the human_g1k_v37.fasta.gz file from the 1000G FTP. But I cannot uncompress it to .fasta, my Mac return Error 32 - Broken pipe.

ADD REPLY
0
Entering edit mode

It ensures that multi-allelic sites are split into multiple rows in the VCF, and also that your ref alleles match the sequence in the reference genome FASTA. It may or may not be important for your work.

The reference FASTA has to be uncompressed, not as gzip.

ADD REPLY
0
Entering edit mode

I successfully normalize my file and got the bcf file. But when I apply the below command to remove duplicate, it returns error. "-D is functional but deprecated, replaced by -d both."

bcftools norm  --remove-duplicates --output file_rmvdup.vcf normalized.bcf

So I tried this, but it does not remove duplicate snps

bcftools norm --rm-dup both --output file_rmvdup.vcf normalized.bcf
ADD REPLY
0
Entering edit mode

Can you possibly paste an example of your VCF here so that I can test?

ADD REPLY
0
Entering edit mode

The original VCF is huge. I share this bcf file I got in the below link. And also where the original VCF is from in the word documents. Thank you so much.

https://www.dropbox.com/sh/sw781hk4n7t1ntv/AABYfT80cBMSFuzKooE_1rKea?dl=0

ADD REPLY
1
Entering edit mode

Thanks for sharing! I'm now beginning to understand why it's not removing these 'duplicate' rs IDs. It's due to the behaviour of BCFtools. There is an indirect way to do it with BCFtools, but also another solution with awk.

As you mentioned, many of the records have the same rs ID but alternate bases, which is perfectly reasonable (a rs ID can have multiple variant alleles). If we take a look at some of these:

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs10152211" | cut -f1-7
15  27221345    rs10152211  C   G   100 PASS
15  27221345    rs10152211  C   T   100 PASS
bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs115550614" | cut -f1-7
15  34708366    rs115550614 G   A   100 PASS
15  34708366    rs115550614 G   C   100 PASS

I did some exploring and found the duplicate IDs with this command:

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | cut -f3 | grep -e "^rs" | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'

Solution 1

as above in edited original answer

Solution 2

as above in edited original answer

ADD REPLY
1
Entering edit mode

Thank you so much!

I now use the below command to normalize and merge the multi-allelic, and then save as bcf. An updated bcf file is available here: https://www.dropbox.com/sh/sw781hk4n7t1ntv/AABYfT80cBMSFuzKooE_1rKea?dl=0

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.vcf.gz  | bcftools norm -Ou --fasta-ref human_g1k_v37.fasta | bcftools annotate -Ob -I +'%ID' > chr15_1000Gphase3_EUR_snp_maf.bcf

But when I check the duplicate SNPs using the below command. There are a few duplicate SNPs remained. I think these were not merged because they have different bp? or they have same alleles?

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | cut -f3 | grep -e "^rs" | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'

# example of duplicate snp
bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs1610794" | cut -f1-7

15      82635194        rs1610794       T       C       100     PASS
15      82932518        rs1610794       T       C       100     PASS

Therefore, for these SNPs I would like to use the Solution 1 you mentioned to keep only one call. I made some revision because I want to save the result as rmvdup.vcf, while your command above applies "bcftools view" to print the result. But then I got stuck here, I tried to place "-Ou" at different position, but the command cannot run, and no vcf was saved.

bcftools norm -Ou chr15_1000Gphase3_EUR_snp_maf_rmvdup.vcf chr15_1000Gphase3_EUR_snp_maf.bcf | awk '!uniq[$3]++'
ADD REPLY
0
Entering edit mode

Okay, this situation is bizarre! I have never seen this before where the same rs ID has different positions in a 1000 Genomes file - I think that these are errors but I cannot be sure.

I used this command to find the SNPs that behave this way:

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | bcftools norm -Ov --check-ref w --fasta-ref human_g1k_v37.fasta | cut -f3 | grep -e "^rs" | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'
rs145926341
rs145926341
rs542232278
rs542232278
rs1610794
rs1610794

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs145926341" -e "rs542232278" -e "rs1610794" | cut -f1-7
15  82605335    rs145926341 C   A   100 PASS
15  82626078    rs542232278 C   G   100 PASS
15  82635194    rs1610794   T   C   100 PASS
15  82902661    rs145926341 C   A   100 PASS
15  82923405    rs542232278 C   G   100 PASS
15  82932518    rs1610794   T   C   100 PASS

The way to combine my 2 solutions together is as follows (note that the bcftools annotate command is not necessary):

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | bcftools norm -Ov --check-ref w --fasta-ref human_g1k_v37.fasta | awk '!uniq[$3]++'

The only problem is that, of the 3 duplicate SNPs that remained, we now have rs542232278 at the incorrect position. I checked each of these SNPs in the UCSC genome browser and the correct positions are:

15 82605335 rs145926341 C A

15 82923405 rs542232278 C G

15 82635194 rs1610794 T C

If you want to ensure that you have no duplicates and also have the correct SNP position for the 3 strange duplicates, then the following works:

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | bcftools norm -Ov --check-ref w --fasta-ref human_g1k_v37.fasta | awk '{if ($2$3!="82626078rs542232278" && $2$3!="82902661rs145926341" && $2$3!="82932518rs1610794") print $0}'

It may be very annoying doing this for each chromosome but, hey, this is the way that bioinformatics is!

Note that the --check-ref parameter should warn when a reference allele in the VCF is not compatible with the reference FASTA. However, the 3 strange duplicates avoid this because the reference base at each coincidentally matches the base in the reference genome too.

Regarding bcftools annotate, it is useful when you want to change the value in the ID field. For example, rs IDs can be difficult to work with, so, to make variant truly unique could could change the VCF ID field to chr:pos:ref:alt with:

bcftools annotate -Ov -x 'ID' -I +'%CHROM:%POS:%REF:%ALT' MyFile.bcf
ADD REPLY
2
Entering edit mode

Here is a possible explanation of Multiple Mapping Positions for a Single SNP

I think I should consider changing the ID to '%CHROM:%POS:%ID'. So I used this command to merge multi-allelic, to normalize, and to replace the ID with chr:pos:id.

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.bcf  | bcftools norm -Ou --fasta-ref human_g1k_v37.fasta | bcftools annotate --output-type b --output chr15_1000Gphase3_EUR_snp_maf_chrposrsid.bcf  -I '%CHROM:%POS:%ID'

chr15_1000Gphase3_EUR_snp_maf.bcf is available here

And then I used this command to check if any duplicate exists in the new ID column

bcftools view chr15_1000Gphase3_EUR_snp_maf_chrposrsid.bcf | cut -f3  | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'

And this command to check the duplicate rs145926341, and now if I use the new ID column (chr:pos:rsid), these two records are different.

bcftools view  chr15_1000Gphase3_EUR_snp_maf_chrposrsid.bcf  | grep -e "rs145926341" | cut -f1-7

15      82605335        15:82605335:rs145926341 C       A       100     PASS
15      82902661        15:82902661:rs145926341 C       A       100     PASS

Thank you so much again.

ADD REPLY
0
Entering edit mode

Looks great! - thanks. No problem.

Thanks regarding the link to - very interesting!

ADD REPLY
0
Entering edit mode

hello, i want to ask a question about duplicate SNP if they in different rows with the same ID ,POS and REF, but alt and corresponding AD and AF is different, they are multiallelic variants?

ADD REPLY

Login before adding your answer.

Traffic: 1061 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6