Subsetting large list of SNPs with R
3
3
Entering edit mode
6.2 years ago
paolo002 ▴ 160

Hi I have a data frame with multiple columns indicating SNPs ID, chromosome number and position of the SNP (BP):

    df1        SNP     CHR     BP

    2   2   rs9391724   6   31320795    5.41323 2   3.83E-07    6.417103
    3   3   rs147949474 4   100738958   5.38602 3   5.74E-07    6.241012
    4   4   rs3819285   6   31322742    5.32917 4   7.65E-07    6.116073
    5   5   rs116548975 6   31320833    5.29763 5   9.57E-07    6.019163
    6   6   rs138441284 4   100338276   5.26365 6   1.15E-06    5.939982
    7   7   rs9391848   6   31320145    5.24588 7   1.34E-06    5.873035
    8   8   rs181964803 6   31319900    5.1553  8   1.53E-06    5.815043
    9   9   rs1377457   11  61144652    5.06155 9   1.72E-06    5.763891
    10  10  rs9266070   6   31319618    5.03007 10  1.91E-06    5.718133
    11  11  rs28835675  19  6807762     4.99524 11  2.11E-06    5.676741
    12  12  rs9266066   6   31319525    4.98046 12  2.30E-06    5.638952
    13  13  rs79215153  4   100398008   4.97751 13  2.49E-06    5.60419
    14  14  rs368752130 6   31320017    4.90517 14  2.68E-06    5.572005
    15  15  rs28873729  19  6807814     4.86929 15  2.87E-06    5.542042
    16  16  rs5771860   22  49203073    4.8598  16  3.06E-06    5.514013
    17  17  6:32451822  6   32451822    4.85825 17  3.25E-06    5.487684
    18  18  rs12791961  11  61152028    4.80564 18  3.44E-06    5.462861
    19  19  rs35957957  6   31320064    4.77551 19  3.64E-06    5.43938
    20  20  rs10897155  11  61141164    4.77384 20  3.83E-06    5.417103
    ........

Then I have another data frame with various gene names with indicated the start and end of that particular gene:

df2      gene         CHR.    start        end

    2   ADCY9           16  4003388     4166186
    3   ADORA2B         17  15848231    15879060
    4   ATP2B4           1  203595689   203713209
    5   C6               5  41142336    41261540
    6   CD36             7  79998891    80308593
    7   CD40LG           X  135730352   135742549
    8   CDH13           16  82660408    83830204
    9   CFTR             7  117105838   117356025
    10  CR1              1  207669492   207813992
    .......

I want to subset the first data frame and get only the SNPs that fall into the gene coordinates based on the rows values from the second data frame (that has to be done for each gene) I wrote something like this but the process is taking long time so I suppose there is something wrong:

for(i in 1:nrow(df1))
  for(i in 1:nrow(df2)){
    subset(df1, df1$BP > df2$start_position & df1$BP < df2$end_position)
  }

Any help highly appreciated, thanks

R range overlap range-overlap • 2.7k views
ADD COMMENT
1
Entering edit mode

You should have a look at the GenomicRanges package from Bioconductor. Transform both the DFs to GRanges objects and then have a look at subsetByOverlaps()

ADD REPLY
1
Entering edit mode

I agree with @ATpoint; the GenomicRanges package provides an excellent data structure to keep positional genomic information in R, and the accompanying methods (including subsetByOverlaps()) are very versatile.

ADD REPLY
0
Entering edit mode

thank you very much, I am trying it out

ADD REPLY
3
Entering edit mode
6.2 years ago
zx8754 12k

We can use data.table::foverlaps, see example:

# example input
snp <- read.table(text = "snp   chr bp
snp1    10  111
snp2    11  222
snp3    12  333
snp4    Y   444
", header = TRUE, stringsAsFactors = FALSE)

gene <- read.table(text = "gene chr start   end
gene1   1   10  200
gene2   11  100 300
gene3   12  300 350
gene4   X   1   100
", header = TRUE, stringsAsFactors = FALSE)

library(data.table)

# both genes and snps must have the same keys. Adding bp as start and end columns.
snp$start <- snp$end <- snp$bp

# now convert to data.table object with the same keys.
setDT(snp, key = c("chr", "start", "end"))
setDT(gene, key = c("chr", "start", "end"))

# and merge on overlap
foverlaps(snp, gene, type = "within", nomatch = 0L)
#    chr  gene start end  snp  bp i.end i.start
# 1:  11 gene2   100 300 snp2 222   222     222
# 2:  12 gene3   300 350 snp3 333   333     333
ADD COMMENT
0
Entering edit mode

Thank you very much for your help. It worked. And it is very fast.

ADD REPLY
2
Entering edit mode
6.2 years ago

Use write.table to write out the data frames to BED files snps.unsorted.bed and genes.unsorted.bed.

Use the chromosome/start/stop/ID convention and sort the BED files with BEDOPS sort-bed to snps.bed and genes.bed.

Then use BEDOPS bedmap to get the SNPs that overlap the genes:

$ bedmap --echo --echo-map --delim '\t' genes.bed snps.bed > answer.bed

The BED file answer.bed can be read back into R.

ADD COMMENT
1
Entering edit mode
6.2 years ago

First let's reuse some code from zx8754 in creating data frames:

snp <- read.table(text = "snp   chr bp
snp1    10  111
snp2    11  222
snp3    12  333
snp4    Y   444
", header = TRUE, stringsAsFactors = FALSE)

gene <- read.table(text = "gene chr start   end
gene1   1   10  200
gene2   11  100 300
gene3   12  300 350
gene4   X   1   100
", header = TRUE, stringsAsFactors = FALSE)

solution with sqldf ( this is inner join, if you want all SNP file data, do a left join):

> library(sqldf)
> sqldf("select * from snp s inner join gene g on s.chr= g.chr and s.bp between g.start and g.end")
   snp chr  bp  gene chr..5 start end
1 snp2  11 222 gene2     11   100 300
2 snp3  12 333 gene3     12   300 350
ADD COMMENT
0
Entering edit mode

thanks a lot to you too, it is also working

ADD REPLY

Login before adding your answer.

Traffic: 2640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6