Be aware up front that this is a difficult line of inquiry.
My two cents are that you should be concerned about the lack of correlation between your data and hypothesis. There is a large biological gap between nucleotide sequence and a state of disease; that gap is filled with complex system biology that is not easily inferred from the nucleotide sequences alone. What that means from a statistics perspective is that your training data (given that you're interested in all diseases or disease in general) is likely not to be simply correlated with your labels. The models will have a difficult time learning the very non-linear and complex relationship between genes and disease, regardless of the models that you use and assuming you had infinite data available to you, which brings up the concern about data availability.
There aren't many comprehensive gene archives that have reliable disease labels. You will find some for specific diseases, such as single nucleotide polymorphisms related to genetic diseases, or certain nucleotide changes that correlate with an increased risk for cancer, etc. If you choose to focus on those, you may have more success. However for the large majority of diseases, this isn't the case and we don't know the genetic causes.
You should also be concerned about the accuracy of data, both in terms of the nucleotide sequences and the labels (disease + or disease -). There's error for both of these, so any training set you obtain is likely to have false positives and false negatives of some kind.
Suggestions that will help with a successful proposal are:
- Narrow your question to a specific, well-characterized, genetic disease process that has been thoroughly studied and funded, such as genetically-linked types of cancer, diabetes, etc. You're more likely to be able to obtain good data related to those specific diseases that have reliable labels, where the labels actually correlate with the nucleotide identity.
- Find someone who is an expert in that disease process and check in with them every so often to make sure your model makes sense from a biological and medical perspective.
- Keep it simple, as much as you can do so. This isn't a new problem and, like many before you, you'll get in over your head very fast if you are too ambitious with the project.
Other researchers that use Biostars probably have additional advice for you based on their experiences.
Thank you for the insights.