Hello, I’m really lost in the sense there is no direct guide, but I knew that when I sign up for grad school ☺. My Idea is very clear, I’m asking here, trying get sense of the scope and challenges involves to evaluate the work need to be done,I'm hoping after you help me that I will be able to sketch a diagram of each major steps, so I can start working out the details.
so, my Initial data set is of 2-color microarray control set (250 subjects) samples where taken from specific tissue location, and includes thousands of genes; I have good age ranges (5-75).
My condition of interest is not very well understood but there are both microarray & RNAseq dataset available online, however my initial data set advantageous in a sense is that is from rare tissue source and known.
My genes of interest list include 15-25 possible candidate genes that I selected from published meta-analysis and reviews; I want to investigate this condition starting from those genes. Specifically, mapping the expression values across life span [from healthy into same life stages in disease state and if possible specific progression state of the disease]
1- to what extents I could utilize this initial dataset; I’m not exploring differential expression overall and general analysis, I’m interested in correlation analysis of theses genes and enrich them. And see what pathways involved, this is introductory analysis, any other ideas about appropriate analysis?
2- I guess in my second phase of this research I have to do a meta- analysis of RNAseq data? Which involve combing control and patients from different experimental designs, and separate patient samples into my main age groups [1-5, 5-15, 20-49, 50+], hoping to get good number of each group but i didn't give thought to how many sample i should have in each one ? so Can you refer me to good guide in combining RNAseq from different datasets, any advice about major issues that I should be aware of? is there away to deal with missing values across different datasets , or should I consider fixing them indiviuially , I'm really lost here :)
2- The expression values of 2 color microarray is an-average of the two dye signals subtracted form the background noise (it took me a while to figure out how to appropriately clean, normalize, and then convert these values into absolute numbers then get the expression matrix ready for analysis in R) while RNAseq data represent expression values as read counts which is discrete value, how doI deal with this? [ mapping light intensity values to read counts? is it even possible? can be meaningful ?]
3- some of these RNAseq studies don’t contain disease progression as phenotype/variable; some have record of only some important symptoms (present or absent), Any advice regarding this?
I’m fairly beginner in analyzing high throughput data, so please don’t assume I know every term,I never worked on multiple dataset before, so before I start learning the how to handle RNAseq data I want to know if my plan is feasible and will constitute a nice solid graduate work.