Question

R Find which gene a SNP position is in

0

Entering edit mode

2.1 years ago

marit.hetland ▴ 50

Hi, I have a dataframe df1 where the first column SNP shows the position of a SNP in a reference genome. I have another dataframe df2 which lists the name of the genes in the reference genome, and the start and end positions of each gene. See examples below.

df1:

SNP,Ref,Alt
18,A,C
24,T,G
1489,G,C

df2:

gene,gene_start,gene_stop
gyrA,1,1289
parC,1290,2890

I want to compare the dataframes to find out which gene a SNP is occurring in, by seeing if the number in the SNP column is within the range (gene_start to gene_stop) of a gene. Desired output:

snp,ref,alt,gene
18,A,C,gyrA
24,T,G,gyrA
1489,G,C,parC

So far, I have managed to do this with a slow loop:

append_df <- setNames(data.frame(matrix(ncol = 4, nrow = 0)), c("snp","ref","alt","gene"))
for (snp_row in 1:nrow(df1)){
      SNP <- df1[snp_row, "SNP"]
      for (row in 1:nrow(df2)) {
        gene <- df2[row, "gene"]
        gene_start <- df2[row, "gene_start"]
        gene_stop <- df2[row, "gene_stop"]
        if(SNP >= gene_start & SNP <= gene_stop) {
          df <- data.frame(SNP,gene,gene_start,gene_stop)
          append_df <- rbind(append_df, df) 
          break 
        }
      }
    }

Can anyone suggest a better (prettier) and ideally faster way to achieve this? Thanks!

SNP R gene • 1.1k views

ADD COMMENT • link updated 2.1 years ago by dariober 15k • written 2.1 years ago by marit.hetland ▴ 50

1

Entering edit mode

Are you sure you don't want to use a proper variant annotation tools like VEP? If your genes have introns you may want to know whether a SNP hits an intron or an exon and what the expected consequence is going to be...?

ADD REPLY • link 2.1 years ago by dariober 15k

0

Entering edit mode

Look into GenomicRanges and findOverlaps() (Bioconductor).

ADD REPLY • link 2.1 years ago by Ram 44k

score 0 · Answer 1 · 2022-10-19

Using fuzzyjoinpackage :

library(fuzzyjoin)
library(dplyr)

## create column gene_start and gene_stop for df1

df1=data.frame(SNP=c(18,24,1489),Ref=c("A","T","G"),Alt=c("C","G","C"))
df1$gene_start=df1$SNP
df1$gene_stop=df1$SNP

## df2
df2=data.frame(gene=c("gyrA","parC"),gene_start=c(1,1290),gene_stop=c(1289,2890))

interval_left_join(df1, df2, by = c("gene_start", "gene_stop"))%>%
      select(SNP,Ref,Alt,gene)

   SNP Ref Alt gene
1   18   A   C gyrA
2   24   T   G gyrA
3 1489   G   C parC