How to associate a row from one data frame with a value in another data frame based on three columns
5
0
Entering edit mode
7.1 years ago
baizm ▴ 10

I have two data frames, df1 one has a list of gene variants from a vcf file and df2 two has a list of predicted genes in a genome assembly. Each variant in df1 occurs within one of the predicted genes in df2. I want to associate each variant with the gene it occurs in.

Here is the first few lines of what they look like:

df1<-data.frame(contig=c(rep('contig_0', 6)),pos=c(899983,937283,951771,991102,1034215,1063818))
df2<-data.frame(pred_gene=c('g1','g2','g3','g4','g5','g6'),
    contig=c('contig_0','contig_0','contig_0','contig_0','contig_2','contig_2'),
    start=c(355079,446820,700794,887159,110971,156060),
    stop=c(355336,462604,707341,1236478,112320,284753))

What I want is to make a third column in df1 with the associated pred_gene from df2. In this case, each variant is in pred_gene g4. There are thousands of contigs represented in df2 and many predicted genes have more than one variant in them.

R • 2.4k views
ADD COMMENT
2
Entering edit mode

seeing all these diverse answers rolling in it'd be great if you could benchmark them on your large data set!

ADD REPLY
1
Entering edit mode

Please format your code in the future. Makes it a lot easier to read.

ADD REPLY
0
Entering edit mode

You say each variant in df1 occurs within one predicted gene, but the contig_0 and contig_2 have pred_gene g1-4 and g5-6, respectively.

ADD REPLY
1
Entering edit mode

Sorry, what? df1 only has contig_0. And you can see all of df1's entries match up to g4, as OP says. All df1$pos values fall between g4's start and stop

ADD REPLY
2
Entering edit mode
7.1 years ago

Here is the solution. They do indeed all map to G4 in this particular example.

require("GenomicRanges")
df1.GR <- makeGRangesFromDataFrame(df1, seqnames.field="contig", start.field="pos", end.field="pos", keep.extra.columns=TRUE)
df2.GR <- makeGRangesFromDataFrame(df2, seqnames.field="contig", start.field="start", end.field="stop", keep.extra.columns=TRUE)

overlaps <- findOverlaps( df1.GR, df2.GR, type="within" )
data.frame(df1[queryHits(overlaps),], df2[subjectHits(overlaps),])

    contig     pos pred_gene contig.1  start    stop
1 contig_0  899983        g4 contig_0 887159 1236478
2 contig_0  937283        g4 contig_0 887159 1236478
3 contig_0  951771        g4 contig_0 887159 1236478
4 contig_0  991102        g4 contig_0 887159 1236478
5 contig_0 1034215        g4 contig_0 887159 1236478
6 contig_0 1063818        g4 contig_0 887159 1236478
ADD COMMENT
1
Entering edit mode

Works like a charm, thank you!

ADD REPLY
0
Entering edit mode

Please try the other solutions, if you have time, and then upvote and/or accept. This way people coming here in the future will instantly know which solutions worked.

Thanks!

Kevin

ADD REPLY
0
Entering edit mode

Is this faster than the sapply I've used below? It's definitely more bioinformatics-y, but what are the pros and cons of our approaches?

ADD REPLY
2
Entering edit mode

Possibly just as quick. It's good to have different solutions. I answer a lot of these types of questions just to keep my programming skills updated

ADD REPLY
0
Entering edit mode

It's doing the parentheses thing again!

findOperlaps(df1.. won't be rendered as

findOverlaps(df1,...
ADD REPLY
0
Entering edit mode

Good spot, Ram! I have edited it. Strange quirk in the formatting

ADD REPLY
1
Entering edit mode
7.1 years ago
Ram 44k

You need apply or one of its variants. I'm using sapply here

df1$gene=sapply(X=df1$pos, FUN=function(x) { df2[df2$start<x & df2$stop>x,"pred_gene"]})

Also, one of my options is set as follows (globally):

options(stringsAsFactors=FALSE)

I'm mentioning that in case you encounter problems - you might need to recreate data frame with that option enabled.

ADD COMMENT
1
Entering edit mode
7.1 years ago
>library(sqldf)    
> sqldf("select * from df1 left join df2 on  df1.pos between df2.start and df2.stop")
        contig     pos pred_gene   contig  start    stop
    1 contig_0  899983        g4 contig_0 887159 1236478
    2 contig_0  937283        g4 contig_0 887159 1236478
    3 contig_0  951771        g4 contig_0 887159 1236478
    4 contig_0  991102        g4 contig_0 887159 1236478
    5 contig_0 1034215        g4 contig_0 887159 1236478
    6 contig_0 1063818        g4 contig_0 887159 1236478
ADD COMMENT
1
Entering edit mode

ZOMG there's something that lets me do SQL with data frames?!?!?! R is insanity, I tell you!

ADD REPLY
1
Entering edit mode
7.1 years ago

just to add to the diversity of answers, here's a solution using the data.table package, which tends to have blazingly fast operations.

library(data.table)
dt1 <- data.table(contig = c(rep('contig_0', 6)),
                             start = c(899983,937283,951771,991102,1034215,1063818),
                            stop = c(899983,937283,951771,991102,1034215,1063818))
dt2 <- data.table(pred_gene = c('g1','g2','g3','g4','g5','g6'),
                           contig = c('contig_0','contig_0','contig_0','contig_0','contig_2','contig_2'),
                           start = c(355079,446820,700794,887159,110971,156060),
                            stop = c(355336,462604,707341,1236478,112320,284753))
setkey(dt2, start, stop) # this assumes that dt2 has fewer, but larger ranges than dt1
foverlaps(dt1, dt2)

result:

   pred_gene   contig  start    stop i.contig i.start  i.stop
1:        g4 contig_0 887159 1236478 contig_0  899983  899983
2:        g4 contig_0 887159 1236478 contig_0  937283  937283
3:        g4 contig_0 887159 1236478 contig_0  951771  951771
4:        g4 contig_0 887159 1236478 contig_0  991102  991102
5:        g4 contig_0 887159 1236478 contig_0 1034215 1034215
6:        g4 contig_0 887159 1236478 contig_0 1063818 1063818
ADD COMMENT
0
Entering edit mode

I'm assuming that most of the solutions will follow the one-based tradition of R, but I'm not sure. You may want to check that.

ADD REPLY
1
Entering edit mode
7.1 years ago

Another solution:

> library(fuzzyjoin)
> fuzzy_join(df1, df2, by = c("pos" = "start", "pos" = "stop"), match_fun = list(`>=`, `<=`))

output:

> fuzzy_join(df1, df2, by = c("pos" = "start", "pos" = "stop"), match_fun = list(`>=`, `<=`))
  contig.x     pos pred_gene contig.y  start    stop
1 contig_0  899983        g4 contig_0 887159 1236478
2 contig_0  937283        g4 contig_0 887159 1236478
3 contig_0  951771        g4 contig_0 887159 1236478
4 contig_0  991102        g4 contig_0 887159 1236478
5 contig_0 1034215        g4 contig_0 887159 1236478
6 contig_0 1063818        g4 contig_0 887159 1236478
ADD COMMENT

Login before adding your answer.

Traffic: 2491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6