Fast LD computation from a VCF file using vcftools, bcftools and tomahawk. Install tomahawk from here.
Following commands are run on the Linux command-line.
For LD computation, pairwise comparisons are performed between all SNPs, therefore it is essential to thin down the number of SNPs. My dataset has about 18 million SNPs. My aim is to bring it down to 20,000 or so. Use vcftools to thin down a VCF file.
vcftools --vcf snp.vcf --recode --recode-INFO-all --thin 50000 --out "snp-thin"
VCFtools - 0.1.15
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf snp.vcf
--recode-INFO-all
--thin 50000
--out snp-thin
--recode
After filtering, kept 64 out of 64 Individuals
Outputting VCF file...
After filtering, kept 26205 out of a possible 18793942 Sites
Run Time = 135.00 seconds
mv "snp-thin.recode.vcf" "snp-thin.vcf"
The argument --thin 50000
means to keep 1 SNP every 50,000 bases. Depending on the number of SNPs in the VCF file, you want to increase or decrease this. I am aiming for about 20,000 or so SNPs in the thinned file.
Convert VCF to BCF.
bcftools view snp.vcf -O b -o snp.bcf
Instructions on using tomahawk is detailed here. Convert BCF to the tomahawk format. Import filters filter out sites with missingness > 20% and HW p-value < 0.001.
tomahawk import -i snp.bcf -o snp -m 0.2 -h 0.001
Tomahawk is used to compute LD.
Arguments: -u
(unphased), -d
(progress), -f
(fast computation), -i
(input file), -o
(output file name prefix), -r
(min R2 cut-off). See tomahawk help for more descriptions.
tomahawk calc -udfi snp.twk -o snp -a 5 -r 0.1 -P 0.1 -C 1 -t 1
The .two
output is binary and needs to be exported to a text file. The -I
arguments restricts the output to a specified chromosome. So, -I chr1
(UCSC) of -I 1
(Ensembl) exports only chromosome 1. It is probably better to have separate files for each chromosome as the files are a more managable size to read into R. The LD metrics are computed between SNPs within chromosomes anyway.
tomahawk view -i snp.two > snp.ld
tomahawk view -i snp.two -I 1 > snp-1.ld
Now we move into R. First we load some libraries and define some plotting functions to be used later.
# load libraries
library(dplyr)
library(ggplot2)
# define plotting functions
#' @title plotPairwiseLD
#' @description Plots R2 heatmap across the chromosome (like Haploview)
#' @param dfr A data.frame with minimum CHROM_A, POS_A, CHROM_B, POS_B and R2.
#' An output from tomahawk works.
#' @param chr A chromosome name.
#' @param xlim A two number vector specifying min and max x-axis limits. Any one or both can be defaulted by specifying NA.
#' @param ylim A two number vector specifying min and max y-axis limits. Any one or both can be defaulted by specifying NA.
#' @param minr2 A value between 0 and 1. All SNPs with R2 value below this
#' value is excluded from plot.
#'
plotPairwiseLD <- function(dfr,chr,xlim=c(NA,NA),ylim=c(NA,NA),minr2) {
if(missing(dfr)) stop("Input data.frame 'dfr' missing.")
if(!missing(chr)) {
ld <- filter(ld,CHROM_A==get("chr") & CHROM_B==get("chr"))
}
ld <- filter(ld,POS_A<POS_B)
if(!missing(minr2)) {
ld <- filter(ld,R2>get("minr2"))
}
ld <- ld %>% arrange(R2)
ld$x <- ld$POS_A+((ld$POS_B-ld$POS_A)/2)
ld$y <- ld$POS_B-ld$POS_A
ld$r2c <- cut(ld$R2,breaks=seq(0,1,0.2),labels=c("0-0 - 0.2","0.2 - 0.4",
"0.4 - 0.6","0.6 - 0.8",
"0.8 - 1.0"))
ld$r2c <- factor(ld$r2c,levels=rev(c("0-0 - 0.2","0.2 - 0.4",
"0.4 - 0.6","0.6 - 0.8",
"0.8 - 1.0")))
ggplot(ld,aes(x=x,y=y,col=r2c))+
geom_point(shape=20,size=0.1,alpha=0.8)+
scale_color_manual(values=c("#ca0020","#f4a582","#d1e5f0","#67a9cf","#2166ac"))+
scale_x_continuous(limits=xlim)+
scale_y_continuous(limits=ylim)+
guides(colour=guide_legend(title="R2",override.aes=list(shape=20,size=8)))+
labs(x="Chromosome (Bases)",y="")+
theme_bw(base_size=14)+
theme(panel.border=element_blank(),
axis.ticks=element_blank()) %>%
return()
}
#' @title plotDecayLD
#' @description Plots R2 heatmap across the chromosome (like Haploview)
#' @param dfr A data.frame with minimum CHROM_A, POS_A, CHROM_B, POS_B and R2.
#' An output from tomahawk works.
#' @param chr A chromosome name.
#' @param xlim A two number vector specifying min and max x-axis limits. Any one or both can be defaulted by specifying NA.
#' @param ylim A two number vector specifying min and max y-axis limits. Any one or both can be defaulted by specifying NA.
#' @param avgwin An integer specifying window size. Mean R2 is computed within windows.
#' @param minr2 A value between 0 and 1. All SNPs with R2 value below this
#' value is excluded from plot.
#'
plotDecayLD <- function(dfr,chr,xlim=c(NA,NA),ylim=c(NA,NA),avgwin=0,minr2) {
if(missing(dfr)) stop("Input data.frame 'dfr' missing.")
if(!missing(chr)) {
ld <- filter(ld,CHROM_A==get("chr") & CHROM_B==get("chr"))
}
ld <- filter(ld,POS_A<POS_B)
if(!missing(minr2)) {
ld <- filter(ld,R2>get("minr2"))
}
ld <- ld %>% arrange(R2)
ld$dist <- ld$POS_B-ld$POS_A
if(avgwin>0) {
ld$distc <- cut(ld$dist,breaks=seq(from=min(ld$dist),to=max(ld$dist),by=avgwin))
ld <- ld %>% group_by(distc) %>% summarise(dist=mean(dist),R2=mean(R2))
}
ggplot(ld,aes(x=dist,y=R2))+
geom_point(shape=20,size=0.15,alpha=0.7)+
geom_smooth()+
scale_x_continuous(limits=xlim)+
scale_y_continuous(limits=ylim)+
labs(x="Distance (Bases)",y=expression(LD~(r^{2})))+
theme_bw(base_size=14)+
theme(panel.border=element_blank(),
axis.ticks=element_blank()) %>%
return()
}
Then we read the file.
# read file
ld <- read.delim("snp.ld",sep="\t",comment.char="#")
The output looks like below:
head(ld)
FLAG CHROM_A POS_A CHROM_B POS_B REF_REF REF_ALT ALT_REF
1 10 1 359 1 100777 103.00000 8.526513e-14 5.000000
2 10 1 100777 1 359 103.00000 8.526513e-14 5.000000
3 2 1 359 1 452615 97.48765 5.512354e+00 10.512354
4 2 1 452615 1 359 97.48765 5.512354e+00 10.512354
5 6 1 359 1 1454410 49.07769 5.392231e+01 1.922308
6 6 1 1454410 1 359 49.07769 5.392231e+01 1.922308
ALT_ALT D Dprime R R2 P ChiSqModel
1 20.00000 0.12573242 1.0000000 0.8734775 0.7629629 4.440203e-19 2.296495
2 20.00000 0.12573242 1.0000000 0.8734775 0.7629629 4.440203e-19 2.296495
3 14.48765 0.08266716 0.6574848 0.5742982 0.3298184 5.573234e-08 1.993300
4 14.48765 0.08266716 0.6574848 0.5742982 0.3298184 5.573234e-08 1.993300
5 23.07769 0.06280179 0.8070154 0.3235735 0.1046998 1.876711e-04 7.703488
6 23.07769 0.06280179 0.8070154 0.3235735 0.1046998 1.876711e-04 7.703488
ChiSqTable
1 97.65926
2 97.65926
3 42.21675
4 42.21675
5 13.40157
6 13.40157
Plot pairwise LD plot.
plotPairwiseLD(ld)
Fig: Pairwise LD plot across a whole chromosome.
plotPairwiseLD(ld,minr2=0.2)
Fig: Pairwise LD plot with a min R2 limit of 0.2.
plotPairwiseLD(ld,minr2=0.2,ylim=c(0,10^7))
Fig: Pairwise LD plot with a min R2 limit of 0.2 and y-axis max limit set to 10^7.
plotDecayLD(ld,minr2=0.25)
Fig: LD decay plot with a min R2 limit of 0.25.
plotDecayLD(ld,minr2=0.25,avgwin=100000)
Fig: LD decay plot with a min R2 limit of 0.25 and R2 values averaged within 10 Kb windows.
End of document
Thanks a lot for the great tutorial. Let me a few questions about
plotPairwiseLD
function please? 1.Could you please provide links to theget()
,filter()
andarrange()
functions documentation (or may be add a short explanations for them in comments)? 2. Why we extract only elemets withPOS_A<POS_B
? 3. Why during distance calculation for the x axis we useld$x <- ld$POS_A+((ld$POS_B-ld$POS_A)/2)
formula, but for y axis justld$y <- ld$POS_B-ld$POS_A
? Thank you again!include.lowest = T
helped me a lot, otherwise i would get:So i used the code:
I don't understand why but a lot of the flags in the tomahawk (version-0.7.0;) program are not working anymore. I've tried this tutorial and I had to change a lot of them.
In the end I had that
Also, the column name was not "POS_A", but "posA" for me.
When I head the data:
Can someone please tell me if there is a paper about tomahawk?