merging VCF files using bcftools - everyone becomes related
0
0
Entering edit mode
9 months ago
MaYo • 0

Hi everyone,

I merged about 3500 VCF files using bcftools [bcftools merge -m none -o merged.vcf.gz my_data/*.vcf.gz]. The merge seemed successful, but before proceeding, I checked for related individuals using PLINK's --genome and KING's kinship. Both indicated that all of my data are first-degree relatives, which seems extreme.

I suspect there might be an issue with the merge, but I'm unsure how to identify and fix it. Any advice would be greatly appreciated!

Thanks, Maayan

VCF PLINK bcftools • 870 views
ADD COMMENT
1
Entering edit mode

Are these gVCF files as mentioned in the title or VCF files? In my understanding, if you merged gVCFs - having an entry for each genomic position, variable or not - the output would be a gVCF again. Possibly, this is not expected by the other tools. bcftools merge has the -g --gvfc to merge the gVCF blocks. Maybe this helps. If that does not work, maybe you could use GATK GenomicsDBImport -> GenotypeGVCFs instead, if the variants were called via HaplotypeCaller.

ADD REPLY
0
Entering edit mode

thanks for the response! i double checked thanks to you - the are regular vcf files, and i still have no idea what is my problem...

ADD REPLY
0
Entering edit mode

Now it gets really difficult to "debug". I sometimes find it hard to predict what will happen when merging, intersecting, or otherwise joining the files. Some ideas:

  • Everything could be fine, just the kinship tools are at fault.
  • Still, the input may be unexpected for those tools. Do they expect only certain types of variants, e.g. only bi-allelic SNPs?
  • Try a smaller random subset of only merging 3-10 files. Do they still come out as siblings? If they do come out as siblings, and they are definitely not, I would just reject the kinship analysis as unreliable.
  • Get bcftools stats of the merged file and plots. Anything unexpected?
    • There are further parameters one could play with, like -m, --merge snps|indels|both|snp-ins-del|all|none|id or also applying bcftools norm to normalize indels.
ADD REPLY
0
Entering edit mode

If you look at the original VCF files, are "0/0" homozygous-REF genotypes present? Or do you have "variant-only" VCF files?

It sure sounds like you have the latter. Unfortunately, that implies you need to backtrack and generate individual VCF files that do contain homozygous-REF genotypes.

ADD REPLY
0
Entering edit mode

Thank you, just checked and you are right! I have variant-only data in my files, no 0/0. I guess there is a way to change ./. values in the merged files into 0/0, but that is problematic since this could be wrong... So I am not sure what to do with it.

Also, can you help me understand why these files show such strong kinship?

Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 1938 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6