The contents of those files are explained in the README; however, I'm not sure how well supported is that format specification (if it is even defined as a format specification).
I have an imputed 1000 Genomes Phase III dataset on my computer that I utilise for various projects —including LD analyses— when needed. If you follow this tutorial, you'll also have this imputed dataset: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
Once you have all of those files, generating a LD heatmap for any region is as simple as this, taken from my notes:
1, extract variants overlapping the region of interest +/- 1Kbp from the 1000 Genomes Phase III data
bcftools view chr3.1kg.phase3.v5.vcf.gz -R Gene_hg19.bed | \
bcftools annotate -Ob -I +'%CHROM:%POS:%REF:%ALT' > Region.bcf ;
bcftools index Region.bcf ;
Here, the co-ordinates of the region of interest are stored in the BED file. I also set the VCF ID field to CHROM:POS:REF:ALT
if it's not already set
2, extract different populations
bcftools view -Ob --samples-file EAS.list Region.bcf > EAS.bcf ;
bcftools index EAS.bcf ;
Here, EAS.list comprises a list of sample IDs representing the East Asian (EAS) super population. I identified these from 20130606_g1k.ped
3, convert the BCF file to PLINK format (requires PLINK >= 1.9)
plink --noweb --bcf EAS.bcf \
--keep-allele-order \
--vcf-idspace-to _ \
--const-fid \
--allow-extra-chr 0 \
--split-x b37 no-fail \
--make-bed \
--out EAS ;
4, export data to HaploView format
plink --noweb --bfile EAS \
--chr 3 --from-bp 135364584 --to-bp 135399507 \
--snps-only no-DI \
--recodeHV \
--out EAS_Haploview ;
5, determine haplotype blocks in HaploView
I have written the following notes:
Start HaploView. In the ‘Welcome to HaploView’ window, select the
‘LINKAGE Format’ tab. Then, click on the ‘browse’ button and select your PED file as ‘Data
File’ - this is the PED file that you generated from PLINK.
Click ‘OK’.
Select the ‘LD Plot’ tab to generate a LD heatmap. To save the plot as a Portable Network Graphics (PNG) file, go to File > Export current tab to PNG
6, LD heatmap in R
Then, you can read that HaploView-exported file back into R to mess around with it:
require(png)
par(mar=c(0,0,0,0), mfrow=c(4,1))
haploblocks <- readPNG("EAS_Haploblocks.png")
plot(1:2, type='n', main="", xlab="", ylab="", xaxt="n", yaxt="n", axes=FALSE)
lim <- par()
rasterImage(haploblocks, lim$usr[1], lim$usr[3], lim$usr[2], lim$usr[4]-0.2)
require(SDMTools)
require(RColorBrewer)
points <- cbind(x=c(1.925, 1.96, 1.96, 1.925), y=c(1.5, 1.5, 1.1, 1.1))
legend.gradient(points, cols=brewer.pal(n=9, name="Greys"), title=bquote(R^2), limits=c(0,1))
NB - the gene track I added also via R, and also via readPNG()
. I previously downloaded it from UCSC and saved it as a separate PNG
Kevin
I thank you for your reply. I should have been more explicit; I wanted bed files with the ld-blocks to use in my analyses.
Then, first, you have to identify the LD blocks, which are calculated at the SNP level and not the base-pair level, per se
Ah, I had no idea. That makes sense :)
Thank you for this tutorial! It helps immensely in my work. May I ask what is imputation? My beginner understanding is that we use reference datasets like 1000 Genome to fill in missing genotypes in our data. So then how and why did you impute the 1000 genome data? Do we need to do it before all analysis?
Yes, that is the general procedure in imputation. We don't have to do it for the data that I used above, as it is already imputed against some other reference panel.
You don't technically have to do any imputation with your own data. We would do it if some of our SNPs failed and/or there was low SNP density / coverage across the region on which we wished to derive haplo-types and -blocks.
Thank you for clarifying! I am getting lots of insights from you.