I have 6 VCF files that contains SNPs only, were produced by GATK. Each VCF represent one individual animal from breed X, so they are biological replicates. I have also another 6 files from breed Y.
I have merged them using BCFtools merge bcftools merge -Oz L10A2_SNP.vcf.gz L10A_SNP.vcf.gz L10B_SNP.vcf.gz L10C_SNP.vcf.gz L10D_SNP.vcf.gz L10E_SNP.vcf.gz -o merged_L10_SNPs.vcf.gz --threads 16
When I checked the number of SNPs in the 6 file using
bcftools view -v snps RossA2_SNP.vcf.gz | grep -v -c '^#'
bcftools view -v snps RossA_SNP.vcf.gz | grep -v -c '^#'
bcftools view -v snps RossB_SNP.vcf.gz | grep -v -c '^#'
bcftools view -v snps RossC_SNP.vcf.gz | grep -v -c '^#'
bcftools view -v snps RossD_SNP.vcf.gz | grep -v -c '^#'
bcftools view -v snps RossE_SNP.vcf.gz | grep -v -c '^#'
The numbers of SNPs were :
RossA2_SNP.vcf.gz: 221337
RossA_SNP.vcf.gz: 225504
RossB_SNP.vcf.gz: 280209
RossC_SNP.vcf.gz: 426710
RossD_SNP.vcf.gz: 271401
RossE_SNP.vcf.gz: 306445
and for the merged file
bcftools view -v snps merged_Ross_SNP.vcf.gz | grep -v -c '^#'
The numbers are as follow : 715116
Given the small number of SNPs is smaller, so I assume that the results in the merged file is a unique records that are common to all the 6 VCF files. ?
First question: Does the merged file contains non-duplicate SNPs from the 6 files ? Second question: If I am using bcftools isec, can I use the merged VCF file from breed X and Y to get the intersections, which would represent all the 6 replicates within each breed ?
Thanks
You can look into your merged file and individual files to see what's happening.