Question

Creating Synthetic Sequences for a ML Model

0

Entering edit mode

10 days ago

biochugs • 0

I was thinking of pursuing a Machine Learning project to familiarize myself more with the topic. I want to create a model to predict the source of an outbreak based on sequence features from DNA sequences isolated from various countries.

Let's assume there's a virus X, and virus X has a conserved protein encoding region of about 1000bp long. This region is used to predict the genotype of the virus. I have thousands of DNA sequences of this coding region, isolated from virus X, from various countries. However, some of these countries are severely underrepresented with less than 10 isolated sequences. While other countries have hundreds of represented sequences. What are some ways to tackle this data imbalance?

Once again, I'm new to this topic. I just want to see if it would be possible to approach this project even with a lack of data from some countries.

DNA ML • 250 views

ADD COMMENT • link updated 9 days ago by Mensur Dlakic ★ 27k • written 10 days ago by biochugs • 0

score 0 · Answer 1 · 2024-04-24

It is a good idea, though I don't see how it would work in practice.

I think you will encounter at least one additional problem on top of data lacking from some countries. What causes an outbreak could be such a small signal that it may not be detectable by any automated method. By that I mean that viruses mutate like crazy compared to cellular organisms, so there is always going to be some change. Now in that sea of mutations one has to find a single mutation or a few of them that change the receptor binding affinity from a chicken or pig to a human. Doesn't strike me like a project that is doable in the first place, and especially so by a newcomer to the field. No offense intended here, we simply have no information about your qualifications beyond what you shared.

As to the simulations you want to perform, there may be some work out there. The problem again is that viruses mutate at high rates and the vast majority of those mutations are non-viable, but they can afford to "sample" the mutational space at such high rates. In clinical outcomes we only see mutations that are neutral or potentially beneficial - let's unify both groups under tolerable mutations. I don't know how one would simulate only tolerable mutations without having a thorough understanding of all viral proteins and their structures, and of the ways they interact with host proteins.