I was thinking of pursuing a Machine Learning project to familiarize myself more with the topic. I want to create a model to predict the source of an outbreak based on sequence features from DNA sequences isolated from various countries.
Let's assume there's a virus X, and virus X has a conserved protein encoding region of about 1000bp long. This region is used to predict the genotype of the virus. I have thousands of DNA sequences of this coding region, isolated from virus X, from various countries. However, some of these countries are severely underrepresented with less than 10 isolated sequences. While other countries have hundreds of represented sequences. What are some ways to tackle this data imbalance?
Once again, I'm new to this topic. I just want to see if it would be possible to approach this project even with a lack of data from some countries.
Thanks for the input. I'll continue to think of more ideas that may be achievable.