Question

Combine VCF files

0

Entering edit mode

21 months ago

Sowmya Pulapet ▴ 70

Hi,

I have 3 vcf files and I want to merge these in the following format:

#CHROM  POS   REF  ALT-1   ALT-2   ALT-3

I used the following command for this purpose:

bcftools merge -m all file1.vcf file2.vcf file3.vcf | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT{0}\t%ALT{1}\t%ALT{2}\n' > combined.txt

These are the number of variants in each file:

file1.vcf = 665,848
file2.vcf = 741,666
file3.vcf = 825,351
combined.txt = 705,445

combined.txt looks like this:

enter image description here

I read that bcftools merge will write output if there is at least one variant at the particular position. In my file combined.txt only ALT-1 has entry in the entire file. ALT-2 and ALT-3 columns have '.' throughout.

If it writes output if there is at least one variant at a position, then how come other two columns are empty ? I don't understand how it decides which variants to be in the file.

Also if there are any other ways to do this please suggest me. Hope someone would help me with this.

Thank you

vcf bcftools • 1.1k views

ADD COMMENT • link updated 21 months ago by Istvan Albert 101k • written 21 months ago by Sowmya Pulapet ▴ 70

score 0 · Answer 1 · 2023-03-02

0

Entering edit mode

21 months ago

Istvan Albert 101k

ALT{2} is the second alternative allele at that position and not the allele in file 2

If you don't have a a second alternative allele at that position you would get no output for ALT{2}

ADD COMMENT • link 21 months ago by Istvan Albert 101k

0

Entering edit mode

Oh, that does make sense. So how can I get alternate allele of second and third files ?

ADD REPLY • link 21 months ago by Sowmya Pulapet ▴ 70

0

Entering edit mode

I think there is a gap in your understanding of the concepts.

Are your VCFs single-sample VCFs?
Are you looking to obtain the genotypes of each sample at each location?

The Alternate allele for a biallelic variant locus will be the same across samples. Only in a multi-allelic locus will there be multiple alternate alleles. So if you wish to get the genotype of each sample, get [\t%TGT] using bcftools query after you're done merging.