I am (sadly) a biologist with minimal bioinformatics (and biostats) experience. I hope my question does not sound too naïve.
I have about 20,000 gene expression values from 5 different conditions (0.5,1.5, 7, 10 months). Now, I would like to examine gene expression as a function of time. e.g. to identify which genes significantly increase and decrease expression over time. That is, the gene expression must increase, e.g. 1,2,3,4,5 and -vice versa- from month 0.5 to month 10.
What would the best method or tool be to perform this kind of analysis?
You can do 20.000 One way anovas (1 factor (time) explaining the variance in gene expression). You have 1 independent variabele time and 1 dependent variable expression of gene x (20.000 times)
Which genes are expressed incrementally in the 5 different conditions, from 0.5 to 10 months?
You can simplify this question if you study only one gene at a time. So, you can reduce your question to -> Is the expression of gene A increasing in these 5 conditions?.
To answer that, you can simply calculate a linear correlation between time and expression of a gene. In R, you can use the lm function (e.g. a tutorial on lm). After calculating a p-value for all the genes in the sample, you will have to apply a Bonferroni correction to filter out the false positives, and retain only the genes that significantly increase or decrease in the sample. Good luck!
(an example of linear regression plot, in R. Taken from this tutorial: a tutorial on lm)
This sounds like what I need. I will give it a go, and comment here later on the the outcome of my analysis. Thankyou (and the other people who responded)!
There are two general approaches to large expression data sets in my experience.
Biological -> Data: Proposed a biological hypothesis and see if that is supported by your data. This is susceptible to confirmation bias.
Data -> Biological: Cluster/group/classify your dataset as unbiasly as you can and look to see if any cluster/groups can be interpreted biologically. Hard to interpret when data is noisy (which it almost always is).
Both can be valid approaches depending on what is known about your biological system and how good your data is.
Here are several things you can to do with your data:
If you have raw count data, pre-process your counts by filtering for lowly expressed transcripts and outliers. How you do that can be as arbitrary as removing anything below 10 counts or removing any transcripts that represents more than 2% of the entire read library. You can try to biologically justify your filter method by removing anything below intron transcription count or removing outlier ribosomal transcripts.
perform differential expression analysis. EdgeR and DEseq are two popular R packages for this.
Set a threshold p-value and fold-change for what you consider differentially expressed. This is also very arbitrary. Most people go with p-value < 0.01 and fold-change > 2.
You can also cluster your expression data using many different unsupervised clustering methods: k-means, hierarchical, self organizing map, block clustering, fuzzy c-means. There are plenty of metrics you can use to determine goodness of the clustering such as: gap statistic, silhouettes, split-silhouettes.
You can also try to de-noise your data a little bit before you attempt clustering. One way to do this is by running a principal component analysis on your data and only taking components representing 80-90% of the variance. Then cluster those components.
You can also look for specific classes of transcripts instead of clustering (increasing/decreasing trends for example). But you have to explicitly state how you want to classify this.
It sounds like you are going with Data -> Biological where you want to classify your data in a certain way (increasing/decreasing trend) and see what biological interpretation you can make. One of the most challenging things when analyzing large expression datasets from a biological perspective is to very explicitly define what you are looking for. When you mean significantly increase/decrease over time, what exactly do you want to see? An increasing/decreasing trend of more than 2 fold change between each consecutive time-point? Compare differential expression between just 0.5 and 10 months and disregard any changes in expression in between? Maybe a certain fold-change between each time-point AND significantly differentially expressed between 0.5 and 10 months?
You can do 20.000 One way anovas (1 factor (time) explaining the variance in gene expression). You have 1 independent variabele time and 1 dependent variable expression of gene x (20.000 times)