How to make a dataset of machine learning model for predicting antibiotic resistant?
1
0
Entering edit mode
2.6 years ago
Kumar ▴ 170

Hi, I am looking to make a machine learning model for identifying antibiotic resistant in bacteria. I have 400 assembled fasta files of salmonella dublin and a metadata file (sample id, date, genes etc). However, I have a fundamental skills of machine learning. Any suggesting would be appreciated of proceeding the dataset for making a model.

Thank you,

learning bacteria antibiotic machine resistant • 989 views
ADD COMMENT
0
Entering edit mode
2.6 years ago
Mensur Dlakic ★ 28k

Not sure that machine learning is the best approach to predict antibiotic resistance. Since most resistance genes are well-known, it is simply a matter of finding a reliable homolog in existing databases.

But to answer your question: modern machine learning (ML) is all about proper data representation. So you need to find a way to represent protein sequences such that those that convey antibiotic resistance are different from those that don't. That can't be simple sequence, because protein lengths differ and ML methods don't handle well inputs of different length.

So how do you represent your protein sequences with vectors of the same length, regardless of protein size? One way would be to do sequence embedding like here, which gives a constant-size vector (1024) for each sequence. There are many other sequence embedding approaches, it should be easy to find them on GitHub. Or you can pick 200 representative antibiotic resistance proteins (or whatever number is needed) and BLAST-compare your proteins of interest to this mini database one at a time. That will give you 200 BLAST E-values which can be used as 200 features for training and classification. There are many other approaches I am sure, and it is up to you to come up with a creative and meaningful representation, even though I think ML methods are an overkill in this instance because sequence comparison methods work fine.

If you run out of ideas I suggest you search the literature for other papers that used ML for antibiotic resistance prediction, and try to emulate how they created their training and validation datasets.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. I am a little bit off route of my question. It has a correction, it is about predicting antimicrobial MICs.

I am trying to follow these two papers (link below). They are using k-mer/unitigs based approach to get MICs using machine learning. However, I am still trying to figure out how to get unitigs from my assembled fasta files. If I use AMRFinder or ResFinder, I will get the AMR genes with nodes and genomic information but the nodes sequence is larger than a unitig.

https://journals.asm.org/doi/pdf/10.1128/JCM.01260-18 https://www.sciencedirect.com/science/article/pii/S1319562X22001309

ADD REPLY
0
Entering edit mode

The less information you provide, the less useful suggestion you get. It may not matter to you, but that also means we are less productive in writing our answers.

ADD REPLY

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6