I have no expertise in biology, I'm a data scientist, and I would like to know if it makes sense, from the biological point of view, to analyze data (SNP data) coming from a single chromosome, and not all 22 chromosomes, to predict the risk of a certain disease.
Should I obligatorily use data from all chromosomes? Why?
Thank you very much. And sorry if it is a very basic question, but I really would like to understand this.
Well, I'm applying machine learning algorithms to the data given to predict a complex disease (the first dataset is of lung cancer and the second of type 2 diabetes). The biologists who gave me the data, delivered data from all 22 chromosomes. I've read in the past few days that both lung cancer and type 2 diabetes are complex diseases, and that they are affected by the mutation on several genes. Since there are several, they can be spread across any chromosome, right? Then, shouldn't I analyse the entire set of 22 chromosomes?
Ok, I wasn't expecting the machine learning as method for SNP data. I thought you were working with sequencing reads aligned to a reference genome, finding variants. Can you be more precise on the type of data you are using then?