R Find which gene a SNP position is in
1
0
Entering edit mode
2.1 years ago

Hi, I have a dataframe df1 where the first column SNP shows the position of a SNP in a reference genome. I have another dataframe df2 which lists the name of the genes in the reference genome, and the start and end positions of each gene. See examples below.

df1:

SNP,Ref,Alt
18,A,C
24,T,G
1489,G,C

df2:

gene,gene_start,gene_stop
gyrA,1,1289
parC,1290,2890

I want to compare the dataframes to find out which gene a SNP is occurring in, by seeing if the number in the SNP column is within the range (gene_start to gene_stop) of a gene. Desired output:

snp,ref,alt,gene
18,A,C,gyrA
24,T,G,gyrA
1489,G,C,parC

So far, I have managed to do this with a slow loop:

append_df <- setNames(data.frame(matrix(ncol = 4, nrow = 0)), c("snp","ref","alt","gene"))
for (snp_row in 1:nrow(df1)){
      SNP <- df1[snp_row, "SNP"]
      for (row in 1:nrow(df2)) {
        gene <- df2[row, "gene"]
        gene_start <- df2[row, "gene_start"]
        gene_stop <- df2[row, "gene_stop"]
        if(SNP >= gene_start & SNP <= gene_stop) {
          df <- data.frame(SNP,gene,gene_start,gene_stop)
          append_df <- rbind(append_df, df) 
          break 
        }
      }
    }

Can anyone suggest a better (prettier) and ideally faster way to achieve this? Thanks!

SNP R gene • 1.1k views
ADD COMMENT
1
Entering edit mode

Are you sure you don't want to use a proper variant annotation tools like VEP? If your genes have introns you may want to know whether a SNP hits an intron or an exon and what the expected consequence is going to be...?

ADD REPLY
0
Entering edit mode

Look into GenomicRanges and findOverlaps() (Bioconductor).

ADD REPLY
0
Entering edit mode
2.1 years ago
Basti ★ 2.0k

Using fuzzyjoinpackage :

library(fuzzyjoin)
library(dplyr)

## create column gene_start and gene_stop for df1

df1=data.frame(SNP=c(18,24,1489),Ref=c("A","T","G"),Alt=c("C","G","C"))
df1$gene_start=df1$SNP
df1$gene_stop=df1$SNP

## df2
df2=data.frame(gene=c("gyrA","parC"),gene_start=c(1,1290),gene_stop=c(1289,2890))

interval_left_join(df1, df2, by = c("gene_start", "gene_stop"))%>%
      select(SNP,Ref,Alt,gene)

   SNP Ref Alt gene
1   18   A   C gyrA
2   24   T   G gyrA
3 1489   G   C parC
ADD COMMENT

Login before adding your answer.

Traffic: 1873 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6