Question

Merge data rich in NA

0

Entering edit mode

8.2 years ago

Ram 44k

Hello everyone!

Here are the disclaimers first:

I am not sure this is a pure bioinformatics question, but then I've been working on this for 2 weeks now and I'm a bioinformatician so I guess it qualifies
It's going to be a (kinda) long question, a typical Ram style rant, so strap in.

I'm trying to integrate data on biological samples from ~35 tabular data files. These could be TXT, CSV or XLS* files. I extracted the content from each file to its own data.frame and then set out merging these data frames to get one huge data frame.

All data frames have an ID column, and that's the column I'd use if I were SQL-JOINing them. However, they can also share other columns. So, to give an extremely simplified example, consider these two files:

File1.xlsx

id    col1    col2    col3
1    A1    B1    C1
2    A2    B2    <NA>
3    <NA>  B3    C3
3    <NA> <NA> C4

File2.txt

id    col1    col4    col3
1    A1    D1    C1
2    A2    D2    C2
3    <NA>  D3    <NA>

My desired output is:

id    col1    col2    col3 col4
1    A1    B1    C1    D1
2    A2    B2    C2    D2
3    <NA>  B3    C3    D3
3    <NA> <NA> C4    D3

Essentially, for each column, among matching rows, pick the non-NA value if only one such value exists. If multiple non-NA values exist, create a new row for each.

I'm using a merge function with by=intersect(colnames(df1),colnames(df2)),all=T. Is there any way I can use a custom match function in the merge function? Am I misunderstanding something about merge and being overly cautious with the by=intersect? Should I abandon R and try and tackle this with a programming language?

What do you recommend?

r python data • 3.0k views

ADD COMMENT • link updated 8.2 years ago by Liun ▴ 30 • written 8.2 years ago by Ram 44k

0

Entering edit mode

8.2 years ago

zjhzwang ▴ 180

library("dplyr")
library("stringr")
file1_df<-tbl_df(read.table("PATH1",header=T,stringsAsFactors = F,na.strings = "<NA>"))
file2_df<-tbl_df(read.table("PATH2",header=T,stringsAsFactors = F,na.strings = "<NA>"))
full_mat<-as.matrix(full_join(file1_df,file2_df,by=c("id","col1","col3")))
rm_na<-function(vec){
  ifallis.na(vec)==T)==T){
    return(vec[1])
  }else {
    return(vec[is.na(vec)==F])
  }
}
new_merge<-function(id){
  find_mat<-full_mat[grep(id,full_mat[,1]),-1]
  if(class(find_mat)=="matrix"){
    temp_df<-data.frame(id,apply(find_mat,2,rm_na))
    temp_str<-matrix(apply(as.matrix(temp_df),1,paste,collapse="\t"),ncol=1)
    return(unique(temp_str))
  }else{
    temp_df<-data.frame(id,t(find_mat))
    temp_str<-matrix(apply(as.matrix(temp_df),1,paste,collapse="\t"),ncol=1)
    return(temp_str)
  }
}
id_list<-matrix(unique(full_mat[,1]),ncol=1)
result<-apply(id_list,1,new_merge)
final_result<-t(apply(matrix(unlist(result),ncol=1),1,function(x){return(str_split(x,"\t")[[1]])}))
colnames(final_result)<-colnames(full_mat)

And the result is :

> final_result
     id  col1 col2 col3 col4
[1,] "1" "A1" "B1" "C1" "D1"
[2,] "2" "A2" "B2" "C2" "D2"
[3,] "3" "NA" "B3" "C3" "D3"
[4,] "3" "NA" "B3" "C4" "D3"

ADD COMMENT • link 8.2 years ago by zjhzwang ▴ 180

2

Entering edit mode

Thank you. I've been using dplyr where it's appropriate, but did not mention it here because it leads to the same problem I'm having right now, which is that it gives me the data frame you show above instead of the data frame I'm looking for.

Pro-tip: The "Your requirement is wrong" response on these kind of posts doesn't make sense. I know what I want, so this is a case of you misunderstanding my requirement, not me being "wrong".

ADD REPLY • link 8.2 years ago by Ram 44k

0

Entering edit mode

Sorry,maybe I misunderstand your meanings,I will check my answer again.

ADD REPLY • link 8.2 years ago by zjhzwang ▴ 180

0

Entering edit mode

zjhzwang edited the comment above. I'm adding this line to anyone confused on the conversation flow. The comment above used to read:

There is no such a line like this : 3 <na> B3 C3 D3

My response to it:

There is no such row in your data frame (and my current one), good sir (/madam). There is definitely one such row in my requirement.

I'm saying "I have Y, I want X". You're saying "No, you want X". Either you're incredibly foresightful (which can definitely be the case, for a lot of people including myself are often blinded by immediate details), or there's a communication gap.

ADD REPLY • link 8.2 years ago by Ram 44k

0

Entering edit mode

I have updated my answer,and if my words make you uncomfortable, I appolgize

ADD REPLY • link 8.2 years ago by zjhzwang ▴ 180

0

Entering edit mode

Not at all - I like having my basic assumptions questioned. If that was your intention, you have my undivided attention :)

ADD REPLY • link 8.2 years ago by Ram 44k

score 2 · Accepted Answer · 2016-12-13

2

Entering edit mode

8.2 years ago

WouterDeCoster 47k

Should I abandon R

Yes. All of us should.

In all seriousness, I would tackle this with a python script. Perhaps because I tackle everything with a python script, but that's not relevant right now :) Probably good R solutions exist but I usually can't wrap my head around stuff like this in R. This sounds like quite a mess too.

I would in python create a dictionary in which each key is an identifier of which the value is again a dictionary in which each column is a key and the value is a list containing the non-NA values which are extracted after sequentially looping over the files. After getting data out of all files I would iterate over the dictionary and write line(s).

Pseudocode:

samplesdict = {}
for file in allmyfiles:
    for line in file:
      linelist = line.split('\t')
      if line[0] in samplesdict.keys(): #check if this sample is already in the dictionary, assuming identifier is always first field
         #add my information to samplesdict[line[0]][col1], samplesdict[line[0]][col2], samplesdict[line[0]][col27] and samplesdict[line[0]][col36]
      else:
         #create entry for this sample

#get which cols are available in all samples
allfields = []
for sample in samplesdict.keys():
    allfields.extend(samplesdict[sample].keys())

#after getting all information out of files
for sample in samplesdict.keys():
    if max([len(x) for x in samplesdict[sample].values()]) > 1:
        #create two lines for this

Ouch.

ADD COMMENT • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Rant: Thanks for the primer. I invested a TON of time learning R and I hate Python because of its white space thing - why would I let a programming language dictate my white space usage? I'll probably end up using it anyway or maybe go to Perl - who knows!

ADD REPLY • link 8.2 years ago by Ram 44k

1

Entering edit mode

If you feel more comfortable putting curly braces around everything you do that's fine :-) And yes, Perl is great if you want your code to look the same before and after RSA encryption.

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

0

Entering edit mode

I grew up around C-based languages. Python's whitespacing makes my line wrapping techniques go haywire. Plus it claims to be object oriented while it's a very filthy implementation of OO, much like (albeit better than) Perl. One too many mismatching quirks for me :)

Off topic rant: Best object orientation I've ever seen is C#, Java comes really really close, except for when it allows static members to be accessed by object references. Static members belong to the class, dammit, not to an object! Although, Java's extends BASE_CLASS implements INTERFACE1,INTERFACE2 is definitely more detailed than C#'s class Derived : Base, Isomething1,Isomething2

ADD REPLY • link 8.2 years ago by Ram 44k

0

Entering edit mode

I'll use Python to read in the files with their headers into a Python (pandas?) data frame - does that sound OK? Would I be a terrible person if I ran a "cursor" through a Python (pandas?) data frame? In other words, which would make me more acceptable to society - iterating through a python structure that holds a collection of the tabular data manually or iterating through an R DF manually?

ADD REPLY • link 8.2 years ago by Ram 44k

0

Entering edit mode

I'm not sure if a python/pandas data frame would make things easier. I'm kinda fond of nested dictionaries for stuff like this, but that's probably also frowned upon by the average man on the street.

ADD REPLY • link 8.2 years ago by WouterDeCoster 47k

Ram · Accepted Answer · 2016-12-14

2

Entering edit mode

8.2 years ago

Liun ▴ 30

library(dplyr)
mergeDIY<-function(df1,df2,unicolname){
  ldf<-left_join(df1,df2,by=intersect(names(df1),names(df2)))
  rdf<-right_join(df1,df2,by=intersect(names(df1),names(df2)))
  mergevector<-function(v1,v2){
    naindex<-whichis.na(v1))
    if(length(naindex)==0) return(v1)
    for(i in naindex){
      if(!is.na(v2[i])){
        v1[i]=v2[i]
      }
    }
    return(v1)
  }
  if(dim(ldf)[1]>=dim(rdf)[1]){
    for(i in 1:dim(ldf)[1]){
      univalue<-ldf[i,unicolname]
      ldf[i,]<-mergevector(ldf[i,],rdf[which(rdf[,unicolname]==univalue),])
    }
    return(ldf)
  }else{
    for(i in 1:dim(rdf)[1]){
      univalue<-rdf[i,unicolname]
      rdf[i,]<-mergevector(rdf[i,],ldf[which(ldf[,unicolname]==univalue),])
    }
    return(rdf)
  }
}

ADD COMMENT • link 8.2 years ago by Liun ▴ 30

1

Entering edit mode

There would be many mistakes in the code if you use it for other format input.since i wrote it by myself.And i am sorry about that.

The function using r package dplyr,and i agree with Mr Wang's opinion, so i use the package.

The function need two matrix or data frame as input.The two matrix must have a unique column which it's 'id' in your example.And not both matrix have duplicated value in the unique column or the function will breakdown.

If the function occurs other strange errors please write me.

ADD REPLY • link updated 8.2 years ago by Ram 44k • written 8.2 years ago by Liun ▴ 30

0

Entering edit mode

Thank you - this looks great. I will work on adapting this for my purposes. There are duplications on the supposed-to-be unique columns on both data frames, so I may have to work around that somehow - I will figure it out.

ADD REPLY • link 8.2 years ago by Ram 44k

0

Entering edit mode

> df1
  id col1 col2 col3
1  1   A1   B1   C1
2  2   A2   B2 <NA>
3  3 <NA>   B3   C3
4  3 <NA> <NA>   C4
> df2
  id col1 col4 col3
1  1   A1   D1   C1
2  2   A2   D2   C2
3  3 <NA>   D3 <NA>
> mergeDIY(df1,df2,"id")
  id col1 col2 col3 col4
1  1   A1   B1   C1   D1
2  2   A2   B2   C2   D2
3  3 <NA>   B3   C3   D3
4  3 <NA> <NA>   C4   D3

ADD REPLY • link 8.2 years ago by Liun ▴ 30