How to merge VCF files without duplicates
1
1
Entering edit mode
3.6 years ago

Hey guys, I want to merge VCF files without containing any duplicates which are generated from three different variant callers. The BAM file is the same for all three variant callers.

1.BCFtools

bcftools mpileup -Ou -f hg19.fa.gz NIST7035_BWA_Samtools_sorted_PCR_RG.bam | bcftools call -mv -Ov -o NIST7035_BWA_Samtools_sorted_PCR_RG_bcftools_call.vcf

2.GATK HaplotypeCaller

java -jar -Xmx6G gatk.jar HaplotypeCaller -R /mnt/x/linux/NIST_Garvan/hg19.fa.gz -I /mnt/x/linux/NIST_Garvan/NIST7035_BWA_Samtools_sorted_PCR_RG.bam -O /mnt/x/linux/NIST_Garvan/NIST7035_BWA_Samtools_sorted_PCR_RG_GATK_HaplotypeCaller.vcf

3.Freebayes**

freebayes -f hg19_freebayes.fa NIST7035_BWA_Samtools_sorted_PCR_RG.bam > NIST7035_BWA_Samtools_sorted_PCR_RG_Freebayes.vcf

I am using bcftools merge, but I think I am getting duplicate calls.

A suggestion with the command line will be very helpful.

Thank you!!!!

vcf • 4.0k views
ADD COMMENT
0
Entering edit mode

but I think I am getting duplicate calls.

show us an example please

ADD REPLY
0
Entering edit mode

I am not sure about that but, BCFtools generates 671010 variants, GATK generates 316799 and Freebayes generates 593455 variants. And merged file goes up to 960578 variants.

ADD REPLY
0
Entering edit mode

what is the output of

bcftools view --no-header -G out.vcf | cut -f1,2,4 | sort | uniq -d 
ADD REPLY
0
Entering edit mode
   > chr1    113196196       TA
    chr1    113202203       TCTCTC
    chr1    115631645       A
    chr1    136644  AG
    chr1    142538258       G
    chr1    143231197       AG
    chr1    143234310       GC
    chr1    143378492       AAG
    chr1    143380641       GT
    chr1    143537385       C
    chr1    145112288       G
    chr1    145370407       TGA
    chr1    149716171       A
    chr1    152448195       G
    chr1    155404294       TAA
    chr1    159896887       C
    chr1    160345283       T
    chr1    161953015       A
    chr1    16360053        CT
    chr1    1650942 A
    chr1    17086315        TC
    chr1    210824620       T
    chr1    2120770 C
    chr1    235715063       TA

 - List item

And some more

ADD REPLY
0
Entering edit mode

what was the command to merge ?

ADD REPLY
0
Entering edit mode

bcftools merge --force-samples file1.vcf file2.vcf file3.vcf >file123.vcf

ADD REPLY
0
Entering edit mode

Hi, what, in your definition, is a duplicate?

  • any variant called at the same POS?
  • one or more variants that have the same ID?
  • samples with the same ID?

Please take a look at the --merge flag with bcftools merge. Also, prior to merging these files, I would normalise them by using bcftools norm -m-any -f ref.fasta

ADD REPLY
0
Entering edit mode

I think there are some duplicates having the same ID. But that's not my main issue here! I am calling variants with 3 different variant callers on the same sample(NA12878 GIAB Garvan data).

BCFtools generates 671010 variants, GATK generates 316799 and Freebayes generates 593455 variants. And merged file goes up to 960578 variants.

Now I have a VCF file from the same project to refer , but it has only 416818 variants. I don't know why is so much difference!!!

ADD REPLY
0
Entering edit mode

In each file from each variant caller, please normalise the variants and set the IDs to be unique. Then do the merge.

Please see what i am doing in Step 4, here: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2

ADD REPLY
0
Entering edit mode

I did exactly as you mentioned in step 4 but it couldn't help with the merging issue!!! what should I do now? Please help

ADD REPLY
0
Entering edit mode

What is the "merging issue", exactly? You tabulated some numbers and think that they are incorrect? Please show records from your individual VCFs, and then the merged VCF, that highlight the issue. Thanks.

ADD REPLY
1
Entering edit mode

Sir, I am new to Bioinfo and developing a simple pipeline that includes Variant calling from 3 different variant callers of the same sample(NA12878 GIAB Garvan data).

Now, After merging with bcftools merge I get a VCF file that contains a Header like that

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878 2:NA12878   3:NA12878

Here you can see it shows three samples of NA12878 because it came from 3 different variant callers. That's why I think I am having a merging issue. Thank you

ADD REPLY
0
Entering edit mode

You didn't answered Kevin's question:

Please show records from your individual VCFs, and then the merged VCF, that highlight the issue. Thanks.

we want to see some variants, not the samples.

ADD REPLY
0
Entering edit mode

Maybe you need concat, not merge

ADD REPLY
0
Entering edit mode

I am sharing screenshots of VCF files generated from 3 variant callers and Last one is merged VCF file. Files are opened using Notepad ++

Variants from BCFTools Variants from GATK Variants from GATK Merged VCF

ADD REPLY

Login before adding your answer.

Traffic: 2604 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6