Disclaimer: I've posted this on Stats.SE as well, posting here as well since it hasn't gotten much attention there and it might actually be more relevant here.
I have been reading up on tidy-data concept and the tidyr package, as well as some tutorials. I started to think how I would go about applying those principles to my real world data and got a bit stuck.
As things stand I have a tabular file, containing measurements of peptides over a number of samples, that is approx 20000 x 66 large (this is essentially a trimmed version already). The structure is as follows:
Observations: 19,576
Variables: 66
$ Cluster ID (chr) "1478878045944", "1478878294868", "1478878406996", "147887...
$ Peptide Sequence (chr) "AAAAKPNNLSLVVHGPGDLR 42.0105647@A1", "AAAANLCPGDVILAIDGFG...
$ External IDs (chr) "NX_Q00796-2,NX_Q00796-1", "NX_Q53GG5-2,NX_Q53GG5-1,NX_Q53...
$ Charge (int) 3, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
$ Average RT (dbl) 65.61118, 99.66750, 105.79376, 43.88064, 49.66660, 38.1385...
$ Average m/z (dbl) 681.3753, 1068.4971, 1063.1664, 745.8591, 832.4116, 503.27...
$ N33 (dbl) 10941400, NA, NA, 21923600, NA, 8519010, 6439150, 8220830,...
$ T38 (dbl) 10589700, NA, NA, 26150600, NA, 16326000, 10410400, 170873...
...
As you can see, the first 6 columns are meta data about the peptides, while the following 60 columns contain the values across the samples (i.e. individuals).
My understanding is that this structure isn't tidy, since each sample should be an observation and each peptide should be a variable. But it's not as simple as transposing the data frame (or data.table), since the first 6 columns need to be modified as well.
If I understand this right, in order to be "tidy" each row should contain the observations (or samples), 60 in total, and some 19000+ columns representing each variable, plus some other meta data e.g. sample type (T/N).
Wouldn't this cause an extremely wide table, it would make it very difficult to view the data. Also I don't think I'll even need to be able to index the table based on Peptide Sequence for example..
Have I misunderstood something? How should I tidy or organize my data, before going into the analysis stage? The goal for the analysis is to join this table with another table which contains clinical data, including survival times and try to find important variables with respect to survival analysis.
Hehe, it looks like "tidy data" is basically database normalization for scientists :P I would stick to learning it properly with an ebook/youtube video that goes through each normal form one by one. That's a more logical way to learn it.
Regarding your specific issues: You seem to have column and row the wrong way around. Normalization is all about making your tables 'long' as in narrow with many rows, as rows are very cheap to index compared to columns. So where you might have a table with
Normalised, you have:
In later normal_forms, because Jeff appears more than once making it a "variable" in tidy data terminology, you'd have a second table called 'mouse names' of which Jeff would be row 1. The table above would have 3 links to row 1 in the mouse names table, rather than three "Jeff"s. This way you can change the name of Jeff in this table to Jeffery, and it will take effect on all rows of our table above (where Jeff was used) instantly. It also is fundamentally different, because previously we have 3 rows, all of which happen to have the same value of Jeff, whereas now they all have a pointer to the same value - whatever that value may be. This allows the index to do some optimizations, and reduces redundancy/entropy which is always a good thing. The same could be done with the mouse body_parts too, since that will occur more than once for other mice.
Sometimes, however, it can be difficult to know when to normalize and when not to. Your NCBI names, for example, if they are unique could be considered values. However, having variate-length data is always a bad thing. Far better to have a table with
However, when i say far better i mean far better for a database administrator. Telling scientists to normalize their experimental data is like telling an airline pilot to try and cut his carbon footprint by recycling plastic. Best way to get a pilot to reduce their carbon footprint is to make a more efficient plane. Best way to get scientists to have "tidy data" is to write programs that don't crap out garbage spagetti formats. You shouldn't have to learn this to use data :/
Thanks for the reply, although not in detail I was familiar with database normalization. The issue here is that I am not really dealing with a database, but rather reading in experimental data into R and working with it there. I get the notion of what you are suggesting but I think dividing the data into multiple data frames is just going to make things complicated for me since, unlike SQL servers, R isn't really optimized to do joins etc AFAIK
Right - because data normalisation isn't particularly important for data scientists :) It's not an R thing, it's just a different objective. One is about curating a "living" dataset and keeping it consistent, while the other is just manipulation of some generally static data for a specific goal. Most of what tidy-data is, is data normalization. I think i even remember it talking about being 3rd Normal Form at one point.
In informatics, there isn't an ideal data structure. The same data can legitimately be put into many different shapes, and it will make sense in some sort of context. The most obvious is that some data structures are better for compact storage, while others are better for rapid access. Some are human readable, some not. Some sorted by position, some by unique ID, etc etc. There really isn't a one-way-fits-all.