I think this is a really good call.
If you look at how tidygraph is organised, there are some ideas in there that could scale to the manipulation of other multi-table datastructures.
The only real issues are that there are multiple tables that are relevant (when plotting, manipulating etc), that none of these tables (assayData, phenoData and featureData) are tidy, and that naively combining these tables would make a single, giant, but tidy, dataframe.
That is, I could write you a function this morning that takes this:
{assayData = [G x N matrix]; phenoData = [N X P data-frame]; featureData = [G X M data-frame]; each component indexed by rownames and colnames}
into this:
eset_tibble = [(GN) x (P + M + 3) tbl-dataframe]
Like this: eset_tibble <- as_data_frame(eset)
Where in addition to the columns in phenoData and featureData, the latter has a column to indicate each of feature_id
, sample_id
(ie, the dimnames from the original ExpressionSet) and the expression value for feature f in sample s. That would be unwieldy to work with, would have tons of duplicated information, and would not be useable in bioconductor tools that expect an ExpressionSet. So what value would it have? It woulld be more compatible with ggplot2, dplyr and tidyr; and so you open up a lot of easy plotting and summarisation tools.
That's not a good enough trade-off for me. But if the expressionSet is filtered / subset before converting it into a tidy dataframe then the ggplot-able benefits start to outweigh the possibility of crashing R every time you attempt this.
Given a bit more time, filter
and select
etc could be written to work with ExpressionSet's directly: I can see a value in using tidyverse functions on bioconductor objects even if the latter don't conform to tidy* standards.
Take filter
though. How would it work in this setting? If it's written to manipulate ExpressionSets, then it should return an ExpressionSet. But should it modify G (the number of features) or should it modify N (the number of samples). In a typical use case I might want to extract the data for some genes, and make a boxplot with-respect to some phenotypic factor. In another, I might want to extract a subset of samples and compute a summary over all the features for those samples. in another I might want to filter on both the samples and the features. So filter
should be able to modify both G and N (and should modify all tables within an eset to respect the changes). Similarly, select
should be able to modify both the columns of featureData and the columns of phenoData according to user requirements.
tidygraph deals with a similar issue. It seems to store graphs as both a nodes data-frame and an edges data-frame. In a workflow, you may want to manipulate both the nodes data-frame (add node-level statistics to it, say) and the edges data-frame (to add or remove edges, or something). To do that you use the same verbs as if you were working with a single data-frame in dplyr. But, tidygraph adds the function activate
so that you can tell it whether to manipulate the edges data-frame or the nodes data-frame.
I can see how this approach would be pretty valuable for manipulating ExpressionSets. So the hypothetical user interface would look something like:
my_eset %>%
# Extract a sub-expressionSet related to gene - this should reduce the rows in both assayData and featureData
activate(features) %>%
filter(feature_id == "ENSG0000001234567") %>%
# Extract some independent variables for the samples - this should reduce the columns in phenoData
activate(samples) %>%
select(my_important_factor, the_covariate_that_makes_it_meaningless) %>%
# Convert the ExpressionSet to a tibble
as_data_frame() %>%
# Plot it out
ggplot(aes(x = my_important_factor, y = expression_level)) +
geom_boxplot() +
facet_wrap(~ the_covariate_that_makes_it_meaningless)
The above should give the same results as the memory killing (but again, hypothetical):
my_eset %>%
as_data_frame() %>%
filter(feature_id = "ENSG0000001234567") %>%
select(feature_id, sample_id, my_important_factor, the_covariate_that_makes_it_meaningless) %>%
ggplot(aes(x = my_important_factor, y = expression_level)) +
geom_boxplot() +
facet_wrap(~ the_covariate_that_makes_it_meaningless)
Benefits other than the fact that a LOT of packages are meant to deal with an ExpressionSet? The only real benefit other than that is memory, since it's going to be (slightly) more memory efficient than the equivalent tidy representation (due to the
phenoData
being duplicated, though it's presumably a factor so you're just duplicating an integer index).As a side note, R (like Fortran and matlab) uses column-major ordering, meaning that values in a column of a matrix are next to each other in memory. So it's not like all the values needed for a computation are conveniently layed out next to each other in memory.
I agree a lot of packages use ExpressionSet, but many don't. So you still will need to convert to or from ExpressionSet eventually.
That's interesting about the column-major ordering. That may explain why many base R approaches are column-centric and all genomic matrices have to be transposed (
prcomp
is probably the most famous example).what kinds of manipulations do you find yourself doing to them?
I can somewhat echo your sentiment, but in my case most of the manipulations are due because I want to make use of ggplot, which requires the long, skinny format rather than the matrix format which ExpressionSet has mastered.
I feel the tidyverse has trained people really well to embrace the long format so people (including myself) tend to forget about many awesome, matrix-based functions (e.g.,
rowMeans
and the like). To me tidyverse and ExpressionSet represent two different mindsets, and I agree that the tidyverse should probably come up with a solution for metadata storage, too.I am not sure if any specific manipulations are especially common. Some easy examples:
filter(gene_name %in% some_genes)
orselect(-contains("WT"))
. Yes, you can rewrite them in base R format, but that feels more awkward to me. Obviously, your ggplot example is even more valid.Yes,
rowMeans
is great. But if your data is already in the long format, you can pipe tosummarize
and then apply any function.Similarly, if you data is in the long format, you can add metadata in additional columns. It's arguably not very elegant, but it works.
"if you data is in the long format, you can add metadata in additional columns. It's arguably not very elegant, but it works." One long format tibble cannot include everything from an ExpressionSet object. One solution is to get a table from ExpressionSet, do sth. like stat, plotting, then write all revised information into the called table in an ExpressionSet.
it's not just that storing the metadata in skinny format isn't elegant, it's more space-consuming as the others have alluded to, too. My approach tends to be similar to data bases, using data.tables and so on: I have a small metainfo data.table, similar to what would be stored in phenoData and merge the corresponding information when needed. That is nowhere near the neat structure of ExpressionSet though. For me, storing data in long format often feels like a waste of memory for those reasons, which is why I try to find the optimal solution for every calculation and try to generate the skinny data.tables only on the fly for piping them into ggplot. That being said, I'm definitely guilty of ending up with way too many skinny data.tables at the end of a day.
Possibly relevant projects: