How to merge two vcf files, which have same variants but don't regard same variants
1
0
Entering edit mode
8.2 years ago
Apprentice ▴ 170

Hi.

Thank you for always help. I have an additional problem.

I would like to merge two vcf files (a.vcf, b.vcf) into one vcf file (c.vcf) using GATK CombineVariants. a.vcf and b.vcf have have same variants but don't regard same variants. Specifically, a.vcf and b.vcf are shown as below;

$ cat a.vcf

CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SampleA SampleB
chr1    897460  v5_202  A   <*:DEL> .   PASS    .   GT:AD:DP    0/0:20,0:20 0/1:14,14:28

$ cat b.vcf

CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SampleC SampleD
chr1    897459  v6_202  CA  C   2068.83 PASS    .   GT:AD:DP    0/0:43,0:43 0/1:40,6:46

As you can see, these files have a data of same variant, but coordinates are different. I want to merge the two files into one file and merge these two variant data into one variant data using GATK CombineVariants.

How should I merge the files?

genome snp sequence • 6.1k views
ADD COMMENT
2
Entering edit mode

How can a same variant have different coordinates in different samples. From the ID, it looks like they have been processed using different versions of "something". v5_202 v6_202 So you can't really merge them or they will be represented twice in your VCF file as separate variants.

ADD REPLY
0
Entering edit mode

Thank you for your comment.

Each vcf file was separately called using samples from different capture kit (V5, V6).

ADD REPLY
0
Entering edit mode

Even though It cant have different coordinates for same variants. Essentially you can't merged these two unless have same coordinates.

ADD REPLY
0
Entering edit mode

In that case have them as separate variants or manually correct one of the coordinates. But I do not know if that will have any downstream effects.

ADD REPLY
3
Entering edit mode
8.2 years ago

What you are looking for is called variant normalization or parsimony variant representation. But there is no need for a manual work ;-)

When merging variants, I employ bcftools norm first. In fact, I found the following pipeline to work best:

bcftools norm --multiallelics '-any' a.vcf | bcftools norm -f '/path/to/genome.fa' > a.normed.vcf

After I did it for both files, I merge them using bcftools merge.

If your workflow is GATK based, the appropriate tools chain might be VariantsToAllelicPrimitives, LeftAlignAndTrimVariants, and finally CombineVariants.

ADD COMMENT
0
Entering edit mode

Thank you for your great comment.

I applied the command, which you wrote, to a.vcf and b.vcf, but both files were not changed. Why is it? It seems that REF and ALT allels in b.vcf allels were not left-trimmed. How can I solve the problem ?

ADD REPLY
0
Entering edit mode

First, please do not blindly apply commands someone posts somewhere! Instead, read the manuals and documentation of the commands, try to understand what they are doing and then use them appropriately! If you would have done this and would have a decent understanding of Linux command line, you would have realized that a.vcf is not supposed to change, but that.... (this is left as an exercise - please read the link above)

Yes, the problem is the left alignment and trimming of the variants. In fact, your a.vcf is somewhat wrong because ALT must contain some nucleic acid letters. The representation of the variants used in b.vcf is the correct one.

You can solve the problem by reading my answer, learning the usage of the mentioned tools, and applying the tools to your files - again this is left as an exercise ;-)

ADD REPLY
0
Entering edit mode

Thank you for your advices! I'll learn a format of vcf file.

ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6