Hello Biostars community!
I am currently working on a case/control study in which we ran NGS sequencing for a ~198 SNPs panel. I started using VCFtools ( trough -tabix, parallel and vcfmerge) for vcf indexing and merging. After that I used VCFTOOLS to generate my PLINK file ( .ped and .map) to further running association statistics on PLINK 1.9 ( linux line command ). When I looked to the number of variants as being 356 I got suprised, because my SNP panel was designed to cover only 198 target SNPs instead. With that in mind, I also got a low "Total genotyping rate" of 0.520599 after running a --missing analysis on my PLINK file, which I believe is due to that overcounting of variants. Would anyone be able to give me some advice on solving that problem? This how my script looks like:
(VCF_MERGE) elielson@elielson-VirtualBox:~/bioinfo/arbovirose_all_vcf$ vcftools --vcf merged_vcfs.vcf --plink --out myplink
VCFtools - 0.1.16
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf merged_vcfs.vcf
--out myplink
--plink
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in INFO entry: ID=TYPE,Number=A,Type=String,Description="The type of allele, either snp, mnp, ins, del, or complex.">
Warning: Expected at least 2 parts in FORMAT entry: ID=GQ,Number=1,Type=Integer,Description="Genotype Quality, the Phred-scaled marginal (or unconditional) probability of the called genotype">
After filtering, kept 6 out of 6 Individuals
Writing PLINK PED and MAP files ...
PLINK: Only outputting biallelic loci.
Done.
After filtering, kept 367 out of a possible 367 Sites
Run Time = 0.00 seconds
(base) elielson@elielson-VirtualBox:~/bioinfo/arbovirose_all_vcf$ plink --file myplink --missing --allow-no-sex
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to plink.log.
Options in effect:
--allow-no-sex
--file myplink
--missing
2910 MB RAM detected; reserving 1455 MB for main workspace.
Possibly irregular .ped line. Restarting scan, assuming multichar alleles.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (356 variants, 6 people).
--file: plink-temporary.bed + plink-temporary.bim + plink-temporary.fam
written.
356 variants loaded from .bim file.
6 people (0 males, 0 females, 6 ambiguous) loaded from .fam.
Ambiguous sex IDs written to plink.nosex .
6 phenotype values loaded from .fam.
Using 1 thread.
Before main variant filters, 6 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.520599.
Map and ped files are extremely old and inefficient. plink2 and their associated pgen files are much better to use now.
Similarly, vcftools is deprecated and shouldn’t be used. Use bcftools instead.
Please also format your code with the format button ‘the one with 1s and 0s. Your post is hard to read at the moment.