Hello all,
I've been working on a tool to identify the motifs within an ITS region between 16S and 23S. I've gotten pretty decent results, in particular D1-D1, tRNA and BoxB and BoxA (for some, we find all the motifs, but V3 for example is pretty hard to find)
I'm not a biologist, but I've been working closely with some good people in the field to get a better understanding of how this should all work, and I've gotten pretty good feedback.
One of the problems I'm trying to solve next is that I don't have a way of verifying if the results are correct. For example, my tool sometimes finds more than one potential D1-D1' motif. I present those to the researcher for them to do the final legwork of choosing which one makes sense in that application, but I would love to bridge that gap in the tool.
I've tried running the results through a folding software like mFold or rnaFold, and many times both fold fine (I mean, at a glance, my bio friends look at the structures and they look like they could be accurate). I've asked around and some tell me they would normally read the available research related to the species they are working with to see what makes sense, but that doesn't satisfy me.
Does anyone know if there is a better way I can check if a motif I found is good?
In case you're curious, the tool we made is called CIMS: https://github.com/nlabrad/CIMS-Cyanobacterial-ITS-motif-slicer and you can feed it a fasta file and test it here: https://phylo.dev.
Thanks a bunch!
Hello Mensur,
Thank you for taking the time to look into my question. Honestly this makes me feel a bit better.
I’ve been wondering if building a machine learning model and running a supervised learning with results that have been verified would yield better results and maybe make a more flexible tool, but at this point that is something for the future.
I’m hoping that out tool can at least speed up the process of looking into these motifs for phylogeny research. I’ve seen what many use to manually find the motifs in the sequence and it takes about 5-10 min per sequence, where this tool can run 7000+ in 2 seconds.
If you work with cyanos and have any input on what could make this tool better or more trustworthy, we are more than open to criticism.
Thank you again!
Yes, it would. If you don't mind a shameless self-promotion on my part, an example for that can be found here. You will need a large database of trusted positive and negative matches, and a way to devise some discriminative features beyond simple motif scores. I wouldn't necessarily recommend SVMs as there are more advanced classifiers, but a general idea described in that paper should be applicable to your task.