Short way (quick)
If you have a VCF already, you can just use VCFtools in order to do a very simple linkage disequilibrium (LD) analysis: http://vcftools.sourceforge.net/documentation.html#ld
Long way (more flexibility and comprehensive)
Another, more roundabout approach would be to get your data from VCF to PLINK format, where you could do a more comprehensive analysis. You could have followed my tutorial (Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old) ), which includes the downloading of all 1000 Genomes Phase III data in VCF format and then converting them into PLINK format.
Here is further information for conducting LD analysis in PLINK:
If you follow my tutorial, you'll have the entire 1000 Genomes Phase III samples in PLINK, and from there you can easily filter in/out your samples of interest. See here for details: https://www.cog-genomics.org/plink/1.9/filter
For using a dataset correctly in PLINK, you should create a custom FAM file that matches your dataset and then specify this when performing LD analysis with --fam MyCustom.fam
. A FAM file contains 7 columns:
- Family ID (FID)
- Individual ID (IID)
- Paternal ID (PID)
- Maternal ID (MID)
- Gender (1, male; 2, female)
- Phenotype/Disease status (1, control; 2, case/disease)
The file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped, which you'll get if you follow my tutorial, already contains this information, so, use that and filter out what you don't need.
Also, when reading your data from VCF/BCF into PLINK, it is critical that you specify a sample order file so that PLINK reads the samples in the order that you want and in the order that matches the samples as listed in your custom FAM. A sample command is:
plink --noweb \
--bcf My.bcf \
--keep-allele-order \
--indiv-sort file SampleSort.list \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out PlinkDataForLD
The file mentioned in this command after the --indiv-sort file
command-line parameter, SampleSort.list, contains 2 columns (FID
IID
), like this:
0 NA0165
0 NA0169
et cetera
Then, to do LD analysis in PLINK:
plink --file PlinkDataForLD \
--r2 --ld-window-kb 1000 \
--ld-window 100000 \
--ld-window-r2 0 \
--fam MyCustom.fam
Hi Kevin, I have bim bed fam files from UK Biobank. How do I add phenotype. After I add phenotype, how can I get ped and map files for gplink.
Hi Kevin,
Thanks for your reply. Actually, I tried vcftools, but I got negative values for r2 which of course does not make sense! I am going try the long way approach, and I will let you know the updates.
Thanks, T
Hi Tarek,
Okay, on reflection, you may not require the complex part of creating the custom FAM, considering that all of your samples will be 'healthy' 1000 Genomes samples. The LD analysis will just look at all samples in the dataset and not use information on phenotype, gender, etc.
In that case, you possibly just need to do this:
PLINK is a very good and comprehensive analysis tool, though.
Respond here if you need help or want me to look at anything.
Kevin
Hi Kevin , I was able to do the analysis using a combination of bedtools, vcftools, and plink. I had 9000 SNPs distributed across 4 different genes which I did LD analysis for. Now I have LD analysis result (plink.ld) file. I want to view the analysis graphically, which tool do you recommend?
Thnaks Tarek
Sorry, if I have bed/bim/fam files from 16SrRNAs and a phenotypes file like screenshot, does that mean that I should correlate these files with phenotypes files as OTU?
Yes, this an OTU, I should first convert that to PLINK format.
Hi Kevin,
Is your answer the same if I have WES vcf files processed by GATK from individuals from a large family - with both affected and unaffected individuals?
Would I simply need to combine these files together and then convert the vcf file into .ped format with plink?
What do you recommend doing after that? I'm having trouble figuring out how to place markers within the file to use for linkage downstream.
Any help would be greatly appreciated.
Hey, yes, you adopt the same general approach that I mention in my answer. I think that you'd need
bcftools merge
to merge the files. PLINK, then, has family-specific tests that you could use.