Hello,
I want to get a list of all genes expressed in the adult human brain. My thought is to use the Allen Adult Human Brain Atlas microarray data set, taking any genes expressed in any region in at least once in all 3 Caucasian samples. My justification for using the ABA is that whilst the n is small and are other brain atlases have more specimens, nothing is comparable to the breath of the ABA in terms of whole brain coverage. However, that said perhaps using a data set with bigger n would make more sense in this case?
For a gene to be expressed I would just assume at least 2 probes have a detectable value in any region. This is presumably very noisy but doesn't assume an arbitrary cutoff.
Just wondering what the communities thoughts are and if any one has any suggestions for alternative approaches?
Thanks in advance!
I think that what source to choose and how to define a gene as expressed depends a lot on what you want to do downstream with this information.
If you don't worry much about false positives and you are not concerned about genes/transcripts that might be missed with the arrays used, then the approach you are considering on ABA will likely do. If you want to be more stringent with the same dataset then you should probably consider a gene as expressed if it has detectable expression in all the three Caucasian samples. And so on...
I would look at the NIH Roadmap Epigenomics project. They have RNA-seq for different regions of the brain. As for determing if genes are expressed, we've used some sort of quantile cutoff (i.e. the top 70% of genes) or come up with a "background signal" and anything that is 2 or 3 SD's above that you can consider expressed. For arrays you can use the dark/negative spots for RNA-seq maybe use FRKM from intergenic regions...
Thanks, that seems like a promising strategy! I will look into it. Just to give more background, my reason for doing this is to reduce the dimensionality (number of gene sets) for a pathway analysis I am running.
Yes, that is a good idea. We've spent many hours discussing how to correct for tissue-specific expression when doing gene-set enrichment analysis from expression data. As a crude, quick and dirty method, calculate the average expression per gene across all samples and use the top 75% of those a background gene set. Or just feed those into GO analysis and ignore anything that comes up in both the "background" and "test" datasets