I have VCFs that have some overlapping samples, is there a tool that can do this...
###VCF1:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample1 Sample2 Sample3
SNP1...
SNP2...
SNP3...
###VCF2:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample2 Sample3 Sample4
SNP2...
SNP3...
SNP4...
I want this:
###VCF1+VC2:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample1 Sample2 Sample3 Sample4
SNP1... (missing for Sample4)
SNP2...
SNP3...
SNP4... (missing for Sample1)
not this:
###VCF1+VCF2:
CHR POS ID ALT REF QUAL INFO FILTER FORMAT Sample1 Sample2 Sample3 Sample2_2 Sample3_2 Sample 4
SNP1... (missing for Sample2_2, Sample3_2, and Sample4)
SNP2...
SNP3...
SNP4... (missing for Sample1, Sample2, and Sample3)
In this example of what I do not want, Sample2 and Sample3 would only have SNP1, SNP2, and SNP3 and Sample2_2 and Sample3_2 would have SNP2, SNP3, SNP4.
Is there a tool that can merge VCFs and keep only one copy of each sample?
On face value, all that you require is
bcftools merge
. Pay close attention to the-m
parameter, too. Missing genotypes will be represented as./.
merge
would want to have unique samples over vcfs, we could use--force-samples
but then we get suffixes which OP doesn't want.Yeah, that is exactly my problem. vcf-merge and bcftools merge do not merge same samples. They create new entries for each repeated sample unfortunately.
Would be easier to split these back into individual VCFs and then run
bcftools concat --allow-overlaps --remove-duplicates
to concat the same samples into a single VCF, and then merge everything withbcftools merge
. This will work, as I have done it before for this type of situation.