Question

Is there a programatic way of checking if a cyanobacteiral ITS region motif is accurate?

1

Entering edit mode

21 months ago

Nico ▴ 20

Hello all,

I've been working on a tool to identify the motifs within an ITS region between 16S and 23S. I've gotten pretty decent results, in particular D1-D1, tRNA and BoxB and BoxA (for some, we find all the motifs, but V3 for example is pretty hard to find)

I'm not a biologist, but I've been working closely with some good people in the field to get a better understanding of how this should all work, and I've gotten pretty good feedback.

One of the problems I'm trying to solve next is that I don't have a way of verifying if the results are correct. For example, my tool sometimes finds more than one potential D1-D1' motif. I present those to the researcher for them to do the final legwork of choosing which one makes sense in that application, but I would love to bridge that gap in the tool.

I've tried running the results through a folding software like mFold or rnaFold, and many times both fold fine (I mean, at a glance, my bio friends look at the structures and they look like they could be accurate). I've asked around and some tell me they would normally read the available research related to the species they are working with to see what makes sense, but that doesn't satisfy me.

Does anyone know if there is a better way I can check if a motif I found is good?

In case you're curious, the tool we made is called CIMS: https://github.com/nlabrad/CIMS-Cyanobacterial-ITS-motif-slicer and you can feed it a fasta file and test it here: https://phylo.dev.

Thanks a bunch!

folding motif secondary cyanobacteria ITS • 726 views

ADD COMMENT • link updated 21 months ago by Mensur Dlakic ★ 28k • written 21 months ago by Nico ▴ 20

score 1 · Answer 1 · 2023-02-13

1

Entering edit mode

21 months ago

Mensur Dlakic ★ 28k

What you struggle with is not unique to your problem. Sequence motifs are only as good as a collection of sequences that were used to build them. Often one can find several borderline hits that all pass the eye test, and in your case they fold correctly as well. Most sequence-based tools shoot for high specificity. Getting them to be both highly sensitive and highly specific may not always be possible with fixed cut-offs, and it isn't unique to your problem.

Whether that satisfies you or not, sometimes the only way to tell the motif is real - or better yet, that it is biologically relevant - is to do the experiment.

ADD COMMENT • link 21 months ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Hello Mensur,

Thank you for taking the time to look into my question. Honestly this makes me feel a bit better.

I’ve been wondering if building a machine learning model and running a supervised learning with results that have been verified would yield better results and maybe make a more flexible tool, but at this point that is something for the future.

I’m hoping that out tool can at least speed up the process of looking into these motifs for phylogeny research. I’ve seen what many use to manually find the motifs in the sequence and it takes about 5-10 min per sequence, where this tool can run 7000+ in 2 seconds.

If you work with cyanos and have any input on what could make this tool better or more trustworthy, we are more than open to criticism.

Thank you again!

ADD REPLY • link 21 months ago by Nico ▴ 20

1

Entering edit mode

I’ve been wondering if building a machine learning model and running a supervised learning with results that have been verified would yield better results and maybe make a more flexible tool, but at this point that is something for the future.

Yes, it would. If you don't mind a shameless self-promotion on my part, an example for that can be found here. You will need a large database of trusted positive and negative matches, and a way to devise some discriminative features beyond simple motif scores. I wouldn't necessarily recommend SVMs as there are more advanced classifiers, but a general idea described in that paper should be applicable to your task.

ADD REPLY • link 21 months ago by Mensur Dlakic ★ 28k