Hello fellow bioinformaticians,
This may well be an easy and solved problem, but I didn't find a standard solution for this. I'm also extremely new to the field, so please excuse me :)
I have expression data for different transcripts from 386 proteins in 25 different tissues (from GTEx - yes, the one that was getting all the bad rep recently...). I'm trying to find out if there are any proteins that have transcripts that are differentially expressed across tissues. I know that the transcripts themselves will be expressed at very different levels, but I want to find out what transcripts have a different expressions pattern.
What I'm doing right now is:
For each protein:
- Get the RPKM values for each transcript in each tissue
- Sort the transcripts based on total RPKM across all tissues (so that the "reference" transcript is the one that's expressed the most)
- Perform linear model fitting in R rpkm ~ tissue * transcript
- At this point I wasn't sure what to do exactly to figure out the important ones. I tried just performing ANOVA, but that seems to return that ALL proteins are significant. I tried looking at the summary of the model for each protein and just pick out the coefficients that corresponded to a low p value for a tissue-transcript, but that seemed to not give correct results either.
So in short, I'm just wondering if there's a standard tool or pipeline for determining if different transcripts of the same gene have different expression patterns across tissues
That's an interesting question, although I couldn't manage to exactly figure what you are after. The notion of "pattern" would require some definition I think. If the reference transcript of a protein is highly expressed in a single tissue, and not expressed at all in the others, would it be a hit? Since you use the word "pattern", I immediately though of bi-clustering. Wouldn't that give you groups of transcripts having similar expression profiles across a wide range of samples?
To simplify matters, I would just define the expression of the "reference" transcript as the "pattern" to compare other transcripts against. I realize that this is not perfect, but I figure it's a fair enough starting point. So, given the expression of the reference transcript, which transcripts vary in the expression pattern at some tissue? I generated heatmaps to see this visually, and it looks like for many proteins, most transcripts follow a very similar expression pattern across tissues, but there are interesting cases where one tissue would have a spike at a specific tissue that the other transcripts don't. This is the kind of data I want to find systematically rather than visually.
The word "pattern" is used a bit too liberally for me to grasp what you mean, sorry :( But I think I understood what you're after: Not differential expression per se, but more to answer questions of they type " Is transcript A the reference transcript in all tissues?", where reference transcript is defined as the transcript with max RPKM level. Similarly, "by reference to transcript A, is transcript B always second across all tissues?". If that's the case, I think I found something for you.
Sort of :)
Let's say this is my data (column names are transcripts, row names are tissue types):
I'll define A as the reference transcript simply because it has the highest total RPKM (10+20+5+30)
For transcript C, you can see that while the transcript itself is expressed much less, it follows the same "pattern" - all the values are roughly 1/10 of A. But for B, based on brain,liver,lung it seems like B is expressed at half the frequency of A, but kidney doesn't follow that pattern - it is way underexpressed (rpkm of only 3 instead of expected ~15)
So in this case, I would want to mathematically learn that from this dataset, transcript B at kidney is an interesting observation.
Hopefully this makes it a little clearer. If not, don't worry too much about it, I'll figure it out :)
Yep it does make things clearer! Do you have replicates? Or can you group the tissues so as to have more degrees of freedom?
There are multiple samples form each transcript (coming from multiple people) However, the number of samples per tissue is not consistent. For example, there are over 300 samples from brain, but 50-100 for most other tissues.
So does the tool you mentioned help with this?