Entering edit mode
5.4 years ago
mel22
▴
100
Dear all, I am working on a big population of many pooled case-controls study , and for the genetic analysis I would like to perform a first analysis on a part of the population than to validate the results on the second part. How I can I have two similar groups from the initial population ? How can I do this ?
Thank you for your help !
Without knowing anything about the structure of the data and how it's going to be processed, the only advice that can be given is to use a random split. For machine learning applications, it's common to use 67-80% of the data for training and the rest for testing. Both the training set and the test set have to be representative and the test set has to be large enough for results to be meaningful.
Thank you Jean-Karim, It's envirmontal exposure data and genotyping data (DNA Chip), and I would like to caracterize interaction between exposure and some SNP's. So I am trying to validate my results in a secod part of the population ... I am using plink and R, how can I split may data in R in the best way (accepted methodology) ?
Thank you
If you're going to use R to apply supervised machine learning algorithms, I would suggest to look into the caret package. It has a createDataPartition() function for splitting data.