Hi All,
I hope you can guide me a hand here.
I have a SNP multi-sample vcf file (n=259 people) from target exome sequencing with ~20x coverage; this file has been processed with GATK by a big Center so I fairly trusted their work. This multi-sample vcf file contains ~70 close relatives (mostly siblings,1st cousins, parents), so I expect king to estimate the relatedness accurately. Also, these people come from a fairly homogenous (and isolated) population so relatedness should be high.
I have further processed this vcf file using the following commands:
1: normalize:
bcftools norm -m-any x.vcf -Ov > Norm.vcf
2: left align:
bcftools norm -f genome.fa -o Norm.Aligned.vcf Norm.vcf
Then in plink/1.9:
plink --vcf Norm.Aligned.vcf --make-bed --out binary --allow-no-sex
Then in king:
./king -b binary.bed --kinship
Output:
Between-family kinship data saved in file king.kin0
Note --kinship --degree <n> can filter & speed up the kinship computing.
X-chromosome analysis... X-chromosome genotypes stored in 777 64-bit words for each of 259 individuals. Within-family kinship data saved in file kingX.kin Relationship inference across families starts at Thu Apr 13 18:08:43 2023 ends at Thu Apr 13 18:08:43 2023 Between-family kinship data saved in file kingX.kin0 KING ends at Thu Apr 13 18:08:43 2023
This is what I obtained with using --related
I have also repeated the same processing without left-aligning (just normalizing), and with/without Plink2. I always obtained the same result.
Any thoughts on what I am missing?
Edited
The file contains 2.9 mill SNPs, and I have run quantitative traits associations with these data, that have been replicated by other folk. So, I may be vcf --> plink incorrectly or missing something else.
One thing is that the relationships should be apparent on an MDS plot; have you taken a look? That should tell you if vcf -> plink is broken.
Typically this is not applied on WGS data but instead on microarray data where sites are known to be polymorphic, which means that rare/private variants are for the most part excluded. I don't know if KING has a filter on frequency; but it is possible that private variants may be driving this.
What happens if you subset to those variants with a MAF of say 10% or higher (0.1 < freq < 0.9) in this cohort?
Thanks for answering. The VCF file seems ok. It seems KING is not adequate for WES studies where SNPs are not called across most of the samples.
I tried KING on a GWAS dataset from the same samples and it worked fine.
Thanks