Hello everyone!
Here are the disclaimers first:
- I am not sure this is a pure bioinformatics question, but then I've been working on this for 2 weeks now and I'm a bioinformatician so I guess it qualifies
- It's going to be a (kinda) long question, a typical Ram style rant, so strap in.
I'm trying to integrate data on biological samples from ~35 tabular data files. These could be TXT, CSV or XLS* files. I extracted the content from each file to its own data.frame and then set out merging these data frames to get one huge data frame.
All data frames have an ID column, and that's the column I'd use if I were SQL-JOIN
ing them. However, they can also share other columns. So, to give an extremely simplified example, consider these two files:
File1.xlsx
id col1 col2 col3
1 A1 B1 C1
2 A2 B2 <NA>
3 <NA> B3 C3
3 <NA> <NA> C4
File2.txt
id col1 col4 col3
1 A1 D1 C1
2 A2 D2 C2
3 <NA> D3 <NA>
My desired output is:
id col1 col2 col3 col4
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 <NA> B3 C3 D3
3 <NA> <NA> C4 D3
Essentially, for each column, among matching rows, pick the non-NA
value if only one such value exists. If multiple non-NA
values exist, create a new row for each.
I'm using a merge function with by=intersect(colnames(df1),colnames(df2)),all=T
. Is there any way I can use a custom match
function in the merge
function? Am I misunderstanding something about merge
and being overly cautious with the by=intersect
? Should I abandon R and try and tackle this with a programming language?
What do you recommend?
Rant: Thanks for the primer. I invested a TON of time learning R and I hate Python because of its white space thing - why would I let a programming language dictate my white space usage? I'll probably end up using it anyway or maybe go to Perl - who knows!
If you feel more comfortable putting curly braces around everything you do that's fine :-) And yes, Perl is great if you want your code to look the same before and after RSA encryption.
I grew up around C-based languages. Python's whitespacing makes my line wrapping techniques go haywire. Plus it claims to be object oriented while it's a very filthy implementation of OO, much like (albeit better than) Perl. One too many mismatching quirks for me :)
Off topic rant: Best object orientation I've ever seen is C#, Java comes really really close, except for when it allows static members to be accessed by object references. Static members belong to the class, dammit, not to an object! Although, Java's
extends BASE_CLASS implements INTERFACE1,INTERFACE2
is definitely more detailed than C#'sclass Derived : Base, Isomething1,Isomething2
I'll use Python to read in the files with their headers into a Python (pandas?) data frame - does that sound OK? Would I be a terrible person if I ran a "cursor" through a Python (pandas?) data frame? In other words, which would make me more acceptable to society - iterating through a python structure that holds a collection of the tabular data manually or iterating through an R DF manually?
I'm not sure if a python/pandas data frame would make things easier. I'm kinda fond of nested dictionaries for stuff like this, but that's probably also frowned upon by the average man on the street.