Forum:ExpressionSet in the age of tidyverse
1
12
Entering edit mode
6.6 years ago
igor 13k

The ExpressionSet class is supposed to be a standardized data structures to represent genomic data. It was introduced many years ago and was an improvement over the standard R data frames and base R data manipulation methods. However, is it still as big of an improvement today? Ever since I discovered dplyr and the rest of tidyverse, I have been much happier with my R experience. If I encounter an ExpressionSet object, I often end up having to extract the relevant parts, manipulating them externally, and then importing them back in. That workflow seems convoluted and feels like completely negating the whole purpose of the ExpressionSet in the first place. Am I doing it wrong? Are there benefits of ExpressionSet that I am not thinking of?

expressionset R bioconductor • 3.8k views
ADD COMMENT
3
Entering edit mode

Benefits other than the fact that a LOT of packages are meant to deal with an ExpressionSet? The only real benefit other than that is memory, since it's going to be (slightly) more memory efficient than the equivalent tidy representation (due to the phenoData being duplicated, though it's presumably a factor so you're just duplicating an integer index).

As a side note, R (like Fortran and matlab) uses column-major ordering, meaning that values in a column of a matrix are next to each other in memory. So it's not like all the values needed for a computation are conveniently layed out next to each other in memory.

ADD REPLY
1
Entering edit mode

I agree a lot of packages use ExpressionSet, but many don't. So you still will need to convert to or from ExpressionSet eventually.

That's interesting about the column-major ordering. That may explain why many base R approaches are column-centric and all genomic matrices have to be transposed (prcomp is probably the most famous example).

ADD REPLY
2
Entering edit mode

what kinds of manipulations do you find yourself doing to them?

I can somewhat echo your sentiment, but in my case most of the manipulations are due because I want to make use of ggplot, which requires the long, skinny format rather than the matrix format which ExpressionSet has mastered.

I feel the tidyverse has trained people really well to embrace the long format so people (including myself) tend to forget about many awesome, matrix-based functions (e.g., rowMeans and the like). To me tidyverse and ExpressionSet represent two different mindsets, and I agree that the tidyverse should probably come up with a solution for metadata storage, too.

ADD REPLY
0
Entering edit mode

I am not sure if any specific manipulations are especially common. Some easy examples: filter(gene_name %in% some_genes) or select(-contains("WT")). Yes, you can rewrite them in base R format, but that feels more awkward to me. Obviously, your ggplot example is even more valid.

Yes, rowMeans is great. But if your data is already in the long format, you can pipe to summarize and then apply any function.

Similarly, if you data is in the long format, you can add metadata in additional columns. It's arguably not very elegant, but it works.

ADD REPLY
0
Entering edit mode

"if you data is in the long format, you can add metadata in additional columns. It's arguably not very elegant, but it works." One long format tibble cannot include everything from an ExpressionSet object. One solution is to get a table from ExpressionSet, do sth. like stat, plotting, then write all revised information into the called table in an ExpressionSet.

ADD REPLY
0
Entering edit mode

it's not just that storing the metadata in skinny format isn't elegant, it's more space-consuming as the others have alluded to, too. My approach tends to be similar to data bases, using data.tables and so on: I have a small metainfo data.table, similar to what would be stored in phenoData and merge the corresponding information when needed. That is nowhere near the neat structure of ExpressionSet though. For me, storing data in long format often feels like a waste of memory for those reasons, which is why I try to find the optimal solution for every calculation and try to generate the skinny data.tables only on the fly for piping them into ggplot. That being said, I'm definitely guilty of ending up with way too many skinny data.tables at the end of a day.

ADD REPLY
2
Entering edit mode
6.6 years ago
russhh 5.7k

I think this is a really good call.

If you look at how tidygraph is organised, there are some ideas in there that could scale to the manipulation of other multi-table datastructures.

The only real issues are that there are multiple tables that are relevant (when plotting, manipulating etc), that none of these tables (assayData, phenoData and featureData) are tidy, and that naively combining these tables would make a single, giant, but tidy, dataframe.

That is, I could write you a function this morning that takes this:

{assayData = [G x N matrix]; phenoData = [N X P data-frame]; featureData = [G X M data-frame]; each component indexed by rownames and colnames}

into this:

eset_tibble = [(GN) x (P + M + 3) tbl-dataframe]

Like this: eset_tibble <- as_data_frame(eset)

Where in addition to the columns in phenoData and featureData, the latter has a column to indicate each of feature_id, sample_id (ie, the dimnames from the original ExpressionSet) and the expression value for feature f in sample s. That would be unwieldy to work with, would have tons of duplicated information, and would not be useable in bioconductor tools that expect an ExpressionSet. So what value would it have? It woulld be more compatible with ggplot2, dplyr and tidyr; and so you open up a lot of easy plotting and summarisation tools.

That's not a good enough trade-off for me. But if the expressionSet is filtered / subset before converting it into a tidy dataframe then the ggplot-able benefits start to outweigh the possibility of crashing R every time you attempt this.

Given a bit more time, filter and select etc could be written to work with ExpressionSet's directly: I can see a value in using tidyverse functions on bioconductor objects even if the latter don't conform to tidy* standards.

Take filter though. How would it work in this setting? If it's written to manipulate ExpressionSets, then it should return an ExpressionSet. But should it modify G (the number of features) or should it modify N (the number of samples). In a typical use case I might want to extract the data for some genes, and make a boxplot with-respect to some phenotypic factor. In another, I might want to extract a subset of samples and compute a summary over all the features for those samples. in another I might want to filter on both the samples and the features. So filter should be able to modify both G and N (and should modify all tables within an eset to respect the changes). Similarly, select should be able to modify both the columns of featureData and the columns of phenoData according to user requirements.

tidygraph deals with a similar issue. It seems to store graphs as both a nodes data-frame and an edges data-frame. In a workflow, you may want to manipulate both the nodes data-frame (add node-level statistics to it, say) and the edges data-frame (to add or remove edges, or something). To do that you use the same verbs as if you were working with a single data-frame in dplyr. But, tidygraph adds the function activate so that you can tell it whether to manipulate the edges data-frame or the nodes data-frame.

I can see how this approach would be pretty valuable for manipulating ExpressionSets. So the hypothetical user interface would look something like:

my_eset %>%
# Extract a sub-expressionSet related to gene - this should reduce the rows in both assayData and featureData
    activate(features) %>%
    filter(feature_id == "ENSG0000001234567") %>%
# Extract some independent variables for the samples - this should reduce the columns in phenoData
    activate(samples) %>%
    select(my_important_factor, the_covariate_that_makes_it_meaningless) %>%
# Convert the ExpressionSet to a tibble
    as_data_frame() %>%
# Plot it out
    ggplot(aes(x = my_important_factor, y = expression_level)) +
    geom_boxplot() +
    facet_wrap(~ the_covariate_that_makes_it_meaningless)

The above should give the same results as the memory killing (but again, hypothetical):

my_eset %>% 
    as_data_frame() %>%
    filter(feature_id =  "ENSG0000001234567") %>%
    select(feature_id, sample_id, my_important_factor, the_covariate_that_makes_it_meaningless) %>%
    ggplot(aes(x = my_important_factor, y = expression_level)) +
    geom_boxplot() +
    facet_wrap(~ the_covariate_that_makes_it_meaningless)
ADD COMMENT
1
Entering edit mode

That's a very interesting approach. Does activate actually work with ExpressionSet already or was that a hypothetical scenario?

ADD REPLY
0
Entering edit mode

No this is hypothetical (I've updated the answer to indicate this more explicitly). tidygraph is the best example I can find of something using tidyverse language to manipulate multitable data structures (so it made sense to try and steal it's method names & approach). I'll have a look into it's source code over the weekend and see how it can be repurposed to ExpressionSets

ADD REPLY
0
Entering edit mode

Sorry I never posted my code for doing this kind of stuff. Will update when I've had a look at my package again

ADD REPLY

Login before adding your answer.

Traffic: 2074 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6