I have two sets of genomic data, one a list of SNPs and another a list of ChIP-seq-type peaks, each drawn from the same set of multiple samples. I'd like to be able to ask, for each sample, which SNPs overlap with the peaks in the same sample - and to do this fast enough that I can run a permutation test, randomizing the sample assignment of each SNP 100 or 1,000 times and counting the resulting overlaps, to get a statistical idea of which samples are enriched for SNP-peak overlaps. The data is organized as GRanges
objects, and I'm using plyranges
functions for analysis - but any other R package that could do it would be fine.
The problem is that none of the find_overlaps
-style functions can use metadata (e.g. sample ID) as part of their search process, so if I have 100 samples, then each SNP in each sample is checked against the peaks from all 100 samples, which makes produces a huge output table that is clogged with irrelevant comparisons, plus takes a lot of time. My inelegant workaround, as shown in the dummy code below, is to filter the overlap table to keep only hits where the SNV and peak came from the same sample (sample_id.x==sample_id.y
), and use that for counting.
The dummy code reproduces the issue: three samples, 100 SNPs each, randomly assigned from nt 1-2500 except for 25 SNPs of one sample (s03
) that are scattered in a window around position 1000; and four 20 bp peaks, of which one, detected in s03
, overlaps position 1000. On my laptop, it took 46 seconds to complete with only 100 permutations (note that the progress counter for the permutation loop adds essentially no time). It gets much, much longer if you scale up to 100+ samples with realistic numbers of SNPs and peaks! Any advice for a overlap-searching approach that can be constrained based on sample ID-type metadata would be appreciated.
Code:
library(tidyverse)
library(plyranges)
length <- 2500
ids <- c(rep('s01', 100), rep('s02', 100), rep('s03', 100))
snps <- GRanges(Rle('chr1'), IRanges(c(sample((1:length), 100, replace=T),
sample((1:length), 100, replace=T),
sample((980:1040), 25, replace=T),
sample((1:length), 75, replace=T)), width=1),
sample_id=ids)
peaks <- GRanges(Rle('chr1'), IRanges(c(500, 1000, 1000, 2000), width=20),
sample_id=c('s01', 's01', 's03', 's02'))
ov <- find_overlaps(snps, peaks) %>% print() # finds all overlaps, including between mismatched samples
ov <- filter(ov, sample_id.x==sample_id.y) %>% print() # only matched samples
n_permut <- 100 # number of times to permute sample IDs
startTime <- Sys.time()
for(s in unique(snps$sample_id)) {
print(s)
sample_overlaps <- length(filter(ov, sample_id.x==s))
permut_overlaps <- rep(0, n_permut) # vector for counting overlaps of permuted SNPs
for(i in 1:n_permut) {
# progress counter
if(i %% (n_permut/10) == 0) print(paste0((i/n_permut)*100, '% complete'))
snps_permut <- mutate(snps, sample_id=sample(ids, 300, replace=F))
ov_permut <- find_overlaps(snps_permut, peaks) %>% filter(sample_id.x==sample_id.y)
permut_overlaps[i] <- length(filter(ov_permut, sample_id.x==s))
}
p_permut <- length(which(permut_overlaps >= sample_overlaps))/length(permut_overlaps)
print(paste0('sample ', s, ': fraction of permuted SNP distributions with higher overlap = ', p_permut))
}
endTime <- Sys.time()
print(paste0('Time elapsed: ', endTime-startTime))