I have two data frames, df1 one has a list of gene variants from a vcf file and df2 two has a list of predicted genes in a genome assembly. Each variant in df1 occurs within one of the predicted genes in df2. I want to associate each variant with the gene it occurs in.
Here is the first few lines of what they look like:
df1<-data.frame(contig=c(rep('contig_0', 6)),pos=c(899983,937283,951771,991102,1034215,1063818))
df2<-data.frame(pred_gene=c('g1','g2','g3','g4','g5','g6'),
contig=c('contig_0','contig_0','contig_0','contig_0','contig_2','contig_2'),
start=c(355079,446820,700794,887159,110971,156060),
stop=c(355336,462604,707341,1236478,112320,284753))
What I want is to make a third column in df1 with the associated pred_gene
from df2
. In this case, each variant is in pred_gene
g4
. There are thousands of contigs represented in df2
and many predicted genes have more than one variant in them.
seeing all these diverse answers rolling in it'd be great if you could benchmark them on your large data set!
Please format your code in the future. Makes it a lot easier to read.
You say each variant in df1 occurs within one predicted gene, but the contig_0 and contig_2 have pred_gene g1-4 and g5-6, respectively.
Sorry, what?
df1
only hascontig_0
. And you can see all ofdf1
's entries match up tog4
, as OP says. Alldf1$pos
values fall betweeng4
'sstart
andstop