Hi, I have a dataframe df1 where the first column SNP shows the position of a SNP in a reference genome. I have another dataframe df2 which lists the name of the genes in the reference genome, and the start and end positions of each gene. See examples below.
df1:
SNP,Ref,Alt
18,A,C
24,T,G
1489,G,C
df2:
gene,gene_start,gene_stop
gyrA,1,1289
parC,1290,2890
I want to compare the dataframes to find out which gene a SNP is occurring in, by seeing if the number in the SNP column is within the range (gene_start to gene_stop) of a gene. Desired output:
snp,ref,alt,gene
18,A,C,gyrA
24,T,G,gyrA
1489,G,C,parC
So far, I have managed to do this with a slow loop:
append_df <- setNames(data.frame(matrix(ncol = 4, nrow = 0)), c("snp","ref","alt","gene"))
for (snp_row in 1:nrow(df1)){
SNP <- df1[snp_row, "SNP"]
for (row in 1:nrow(df2)) {
gene <- df2[row, "gene"]
gene_start <- df2[row, "gene_start"]
gene_stop <- df2[row, "gene_stop"]
if(SNP >= gene_start & SNP <= gene_stop) {
df <- data.frame(SNP,gene,gene_start,gene_stop)
append_df <- rbind(append_df, df)
break
}
}
}
Can anyone suggest a better (prettier) and ideally faster way to achieve this? Thanks!
Are you sure you don't want to use a proper variant annotation tools like VEP? If your genes have introns you may want to know whether a SNP hits an intron or an exon and what the expected consequence is going to be...?
Look into
GenomicRanges
andfindOverlaps()
(Bioconductor).