EDIT: I tried the same command with PLINK 1.07 and it works fine, but significantly slower. Another issue is that I cannot get D' values using 1.07, I can only get the r2 values.
NOTE: I am very new to PLINK.
I was trying to use PLINK 1.9 to get D' values between all variants in my $snpfile
(code below). The file contains 2770 variants but for some reason, the PLINK logs show that it is extracting 2771 variants. Due to this, the resultant table returned is not what I am expecting. Does anyone have an idea as to why this might be happening?
The command I used is as follows:
$plink \
--bfile $gfile \
--keep $indfile \
--extract $snpfile \
--r2 dprime\
--ld-window $ldwin \
--ld-window-kb 100000 \
--ld-window-r2 0 \
--out $ldmatrixfile
I extended the --ld-window-kb
to include a very large region so I can get the pairwise-LD value between every SNP in my list. The $ldwin
value equals the number of SNPs.
This is the log file output (I have removed any absolute paths):
PLINK v1.90b6.18 64-bit (16 Jun 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to Chrom19/LD_mat219.log.
Options in effect:
--bfile path/to/bed,bim,fam files
--extract Chrom19/snps19.txt
--keep Chrom19/controls19.txt
--ld-window 2770
--ld-window-kb 100000
--ld-window-r2 0
--out Chrom19/LD_mat219
--r2 dprime
773942 MB RAM detected; reserving 386971 MB for main workspace.
190525 variants loaded from .bim file.
333531 people (0 males, 0 females, 333531 ambiguous) loaded from .fam.
Ambiguous sex IDs written to Chrom19/LD_mat219.nosex .
--extract: 2771 variants remaining.
--keep: 10000 people remaining.
Warning: At least 29994 duplicate IDs in --keep file.
Using up to 23 threads (change this with --threads).
Before main variant filters, 10000 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate in remaining samples is 0.991546.
2771 variants and 10000 people pass filters and QC.
Note: No phenotypes present.
--r2 dprime... done.
Results written to Chrom19/LD_mat219.ld .
Are you sure the $snpfile contains that number of variants? What's the result of
wc -l $snpfile
?Yes I have confirmed this. The file contains 2770 variants.
Are you sure your .bim variant IDs are unique?
There are some duplicates in the .bim file, but the separate list of variants that I am providing with the --extract modifier has no duplicates. So I assumed that PLINK would filter based on the list of SNPs I provide in the --extract command. Do you think this might be an issue?
Furthermore, if there were duplicates, wouldn't I expect a lower number of variants being analysed by PLINK instead of a higher number?
Your second paragraph is backwards. If one of the variant IDs in your —extract files appears twice in the .bim, and the other 2769 variants appear once each, —extract would select 2771 variants.