Hi,
I'm trying to do a single sample pathway enrichment analysis with Kallisto/Sleuth. I have 3 control samples, and 3 mutated samples. I have good reasons to believe that the mutated samples have a larger number of genes/pathways differentially expressed in each sample individually, which masks a core set of genes or pathways, that are differentially regulated in all 3. I'm interested in both the common set of pathways and the sample specific ones, so simply comparing 3 control vs 3 mutated won't do it.
I was thinking about comparing the 3 control samples to the mutated samples one-by-one, to define mutated sample specific differentially expressed genes. I estimated transcript level expression with Kallisto, and used Sleuth to aggregate data at the gene level and do the usual differential expression with 3 controls vs 21 mutated sample. I have 3 lists of differentially expressed genes. So far so good (even though the results might not be super reliable).
However, I would really like to do a pathway level analysis with Sleuth instead of the gene level analysis. As Sleuth is working with transcript level data, I had to supply a transcript -> gene table, so it could aggregate transcript level data into gene level data. I can generate a transcript -> pathway table, for example with MSigDB/Reactome sets. However, many genes are part of several pathways, and Sleuth fails at the aggregation step.
reading in kallisto results
dropping unused factor levels
....
normalizing est_counts
88212 targets passed the filter
normalizing tpm
merging in metadata
aggregating by column: pathway
15688 genes passed the filter
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 15599004 rows; more than 4701355 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
I'm trying to figure out what to do with this, and I would appreciate any feedback or comments.
- Is it a reasonable approach at all to compare the 3 control replicates to single mutated samples?
- How would you do the aggregation where genes/transcripts belong to multiple pathways?
Thanks,
Endre
You should not aggregate gene to pathway levels (that does not work since a gene is part of many pathways) instead you should use gene-set analysis tools. The easiest to use in R is probably gProfileR
Yes, that's one of the questions. How to aggregate when a gene belongs to many pathways? :) Lior Pachter wrote some tweets a while ago, that they used kallisto/sleuth for pathway level analysis. Later they had a preprint, where they aggregated transcript level data to do GO enrichment analysis, using sleuth p-values, the Lancaster p-value aggregation and BH correction. This is similar to what I want to do, but not exactly the same and motivated me to think about pathway aggregation.