I work with various high-troughput datasets, and was wondering what would be the best practices for handling (& jointly analysing) phenotype-data and the actual high-throughput measurements in R? I was thinking about using Bioconductor's eSet, where you can store the phenodata, and the actual featuredata in separate slots. But this raises the question, that if I want to use both phenodata and the high-troughput data in the same analysis/model, how to best achieve that?
Practical example:
Phenodata with following variables: age, sex, weight, height, shoesize
High-troughput data with variables: x1 .. x1000000
I would want to do let's say linear regression with lm() function, using the following formula:
weight ~ x1 + age + sex
What's the best way of doing this? Do I just merge phenoData & featureData to single data.frame, and feed that to lm() as the data? To me, this kinda defeats the purpose of storing the data in separate slots to begin with, instead of using just a single data.frame with merged phenodata and featuredata.. Any ideas?
I strongly suggest you take a look at the limma package (if you are using microarrays) or edgeR package (if you are using NGS data) documentations on ways to perform this analysis. There are other Bioconductor packages that you can use for this but the documentation in those two shall be very useful.