Question

Create linkage disequilibrium file

2

Entering edit mode

6.5 years ago

Dhana ▴ 110

Hi,

I am trying to calculate linkage disequilibrium (LD) values for a group of SNPs to be used as an input for the tool Locus explorer . I followed the tutorial they had mentioned in their help section to create LD file from 1000 Genomes data, When I try to replicate their example ( example - 2.1.2 Calculate LD for list of SNP against all SNPs within 1000kb region.) I get the following plink error;

Note: No phenotypes present.

Error: Duplicate ID '.'.

I checked for possible solutions to this and most of them suggested to use

--list-duplicate-vars

function of plink, in this case there are no duplicate ids and even using the function results in same error. I intend to create a file like this (InputFileFormat - LDfile);

CHR_A - Chromosome on which Index SNP is located (n.b. do not include "chr"). e.g. 2, 23

BP_A - Index SNP start coordinate (Hg19, do not include chromosome or end coordinate for in/del variants). e.g. 104356185

SNP_A - Index SNP ID

CHR_B - Chromosome for SNP in LD with Index SNP (SNP_A)

BP_B - Start coordinate (Hg19, do not include chromosome or end coordinate for in/del variants) of SNP in LD with Index SNP (SNP_A). e.g. 104315667

SNP_B - ID of SNP in LD with Index SNP (SNP_A). e.g. rs10786679, chr10:104329988:D

R2 - LD score between SNP_A and SNP_B (0 to 1). e.g. 1, 0.740917

Note: Lead SNP must be defined relative to itself for plotting purposes, e.g.:

CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2

2 173309618 rs13410475 2 173309618 rs13410475 1

2 173309618 rs13410475 2 172827293 rs148800555 0.0906124

Can anyone suggest a possible solution to this (with or without using PLINK)

Thanks.

linkage disequilibrium plink SNP LD LocusExplorer • 4.0k views

ADD COMMENT • link updated 6.5 years ago by zx8754 12k • written 6.5 years ago by Dhana ▴ 110

1

Entering edit mode

Do you mind sharing your SNP file?

Try using --biallelic-only strict flag to drop SNPs that have same positions?

Added issue at GitHub, we will update the tutotrial file on next release 0.7.1

ADD REPLY • link 6.5 years ago by zx8754 12k

0

Entering edit mode

Hi, thank you for your reply. I tried the --biallelic-only strict flag as well, still the same error message.

The data is not my own so I am not at liberty to share it publicly. I have given a sample of the SNP file below;

snplist.txt

rs148170422

rs547289895

rs543765287

rs9747082

rs574351215

rs8071558

rs6501437

ADD REPLY • link 6.5 years ago by Dhana ▴ 110

score 1 · Accepted Answer · 2018-05-17

Here is a guide on how to get LD files when we have duplicated ID issue in the 1000 genomes vcf files. I am using your example snplist.txt file

We need to get 1000 genomes vcf for the region, usually I give 250Kb flank based on my SNPs in snplist.txt.

tabix -fh ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1-71324000 > genotypes.vcf

This as you described give us an error.

plink --vcf  genotypes.vcf \
--r2 \
--ld-snp-list snplist.txt \
--ld-window-kb 1000 \
--ld-window 99999 \
--ld-window-r2 0 \
--out myLD

# ...
# Error: Duplicate ID '.'.

As some SNP names are actually dots ".", they don't have an RS ID attached to them. So we need to give them unique names.

Convert to plink format

plink --vcf  genotypes.vcf \
--make-bed \
--out genotypes

Here is the cause of the problem, dots as SNP IDs:

head -n5 genotypes.bim
# 17      .       0       56      T       C
# 17      .       0       78      C       G
# 17      .       0       355     A       G
# 17      .       0       684     T       C
# 17      rs62053745      0       828     T       C

Run this R script to make unique names:

Rscript makeSNPnamesUnique.R genotypes.bim 
head -n5 genotypes.bim 
# 17      17_56_T_C       0       56      T       C
# 17      17_78_C_G       0       78      C       G
# 17      17_355_A_G      0       355     A       G
# 17      17_684_T_C      0       684     T       C
# 17      rs62053745      0       828     T       C

Now the SNP IDs are fixed, we run plink LD as usual:

plink --bfile genotypes \
--r2 \
--ld-snp-list snplist.txt \
--ld-window-kb 1000 \
--ld-window 99999 \
--ld-window-r2 0 \                   # to have smaller output file: --ld-window-r2 0.2
--out myLD

Note: If the output file is too big, then consider setting higher R2 filter value, for example --ld-window-r2 0.2, meaning only output R2 values that are more than 0.2.

Check the output

wc -l myLD.ld
# 49172 myLD.ld

head -n5 myLD.ld
#  CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
#     17          834         rs9747082     17           56         17_56_T_C   0.00137027
#     17          834         rs9747082     17          355        17_355_A_G   0.00151223
#     17          834         rs9747082     17          684        17_684_T_C   0.00127126
#     17          834         rs9747082     17          828        rs62053745     0.678207

Finally, we can use this file as input for LocusExplorer.

If you have any further question please post an issue at GitHub LocusExplorer.