Hi I have a data frame with multiple columns indicating SNPs ID, chromosome number and position of the SNP (BP):
df1 SNP CHR BP
2 2 rs9391724 6 31320795 5.41323 2 3.83E-07 6.417103
3 3 rs147949474 4 100738958 5.38602 3 5.74E-07 6.241012
4 4 rs3819285 6 31322742 5.32917 4 7.65E-07 6.116073
5 5 rs116548975 6 31320833 5.29763 5 9.57E-07 6.019163
6 6 rs138441284 4 100338276 5.26365 6 1.15E-06 5.939982
7 7 rs9391848 6 31320145 5.24588 7 1.34E-06 5.873035
8 8 rs181964803 6 31319900 5.1553 8 1.53E-06 5.815043
9 9 rs1377457 11 61144652 5.06155 9 1.72E-06 5.763891
10 10 rs9266070 6 31319618 5.03007 10 1.91E-06 5.718133
11 11 rs28835675 19 6807762 4.99524 11 2.11E-06 5.676741
12 12 rs9266066 6 31319525 4.98046 12 2.30E-06 5.638952
13 13 rs79215153 4 100398008 4.97751 13 2.49E-06 5.60419
14 14 rs368752130 6 31320017 4.90517 14 2.68E-06 5.572005
15 15 rs28873729 19 6807814 4.86929 15 2.87E-06 5.542042
16 16 rs5771860 22 49203073 4.8598 16 3.06E-06 5.514013
17 17 6:32451822 6 32451822 4.85825 17 3.25E-06 5.487684
18 18 rs12791961 11 61152028 4.80564 18 3.44E-06 5.462861
19 19 rs35957957 6 31320064 4.77551 19 3.64E-06 5.43938
20 20 rs10897155 11 61141164 4.77384 20 3.83E-06 5.417103
........
Then I have another data frame with various gene names with indicated the start and end of that particular gene:
df2 gene CHR. start end
2 ADCY9 16 4003388 4166186
3 ADORA2B 17 15848231 15879060
4 ATP2B4 1 203595689 203713209
5 C6 5 41142336 41261540
6 CD36 7 79998891 80308593
7 CD40LG X 135730352 135742549
8 CDH13 16 82660408 83830204
9 CFTR 7 117105838 117356025
10 CR1 1 207669492 207813992
.......
I want to subset the first data frame and get only the SNPs that fall into the gene coordinates based on the rows values from the second data frame (that has to be done for each gene) I wrote something like this but the process is taking long time so I suppose there is something wrong:
for(i in 1:nrow(df1))
for(i in 1:nrow(df2)){
subset(df1, df1$BP > df2$start_position & df1$BP < df2$end_position)
}
Any help highly appreciated, thanks
You should have a look at the
GenomicRanges
package from Bioconductor. Transform both the DFs toGRanges
objects and then have a look atsubsetByOverlaps()
I agree with @ATpoint; the
GenomicRanges
package provides an excellent data structure to keep positional genomic information in R, and the accompanying methods (includingsubsetByOverlaps()
) are very versatile.thank you very much, I am trying it out