Is anyone aware of any tool that is able to perform error-tolerant pattern-matching search on protein FASTA files?
This is a duplicate of my question from bioinformatics.stackexchange.
For example, I want to know, which proteins in my fasta file to match ADNG..C.G
regexp (which represents ADNGCG
pattern). However, I want to be tolerant to matching errors, meaning that I'm good with any protein that differs in any 1 letter in the motif: ADNG..C.D
, ADNG..D.G
, MDNG..C.G
etc are all good. Running all possible variants through grep is possible, but exponentially long for longer patterns (I usually have 15-20 letters in pattern, and scan ~10e6-7
sequences).
I am aware of the tool agrep
(docs). As far as I know, it's not supported anymore, and also does not distinguish letters in which it's allowed to make error. Also, it does not support long enough patterns (with more than 9 errors -- and yes, I tried to recompile it with myself from here and could not get it work). Also, it's not designed for proteins specifically.
I am also aware of ScanProsite and Protein Pattern Find services. The first one is not error-prone, the second one is seems to be a web interface for grep.
I'd recommend against cross-posting. You've mentioned the cross-post here, I'd recommend pasting a link to this post on Bioinfo SE as well.
Done it, thanks for suggestion. And sorry for cross-posting indeed -- I have very little idea about SE and biostars audience overlap. What would be the correct way to do that?
Choose one site and post there only.
Will go for biostars next time -- feels like several times more alive platform than SE.
For bioinformatics, that's the smart choice ;)
Are these DNA or protein sequences? You say protein but all the examples look to be DNA.
Unless I don't know something, sequences I mention are protein -- e.g. they have M, D aminoacids, which I assume are not possible in DNA sequences.
But generally, it shouldn't matter much.
I saw mostly AGC so I assumed DNA, possibly with degenerate characters.
As you say, it won't matter really, it just changes the program you might use.
What is the origin of the patterns of interest? Can't you create HMM profiles of these patterns?
You say ScanProsite is not "error-prone" (by "not error prone", I understand you mean do not allow mismatches in the pattern), I think your assertion is not completely correct. If you want to give a pattern (e.g. P-x(2)-G-E-S-G(2)-[AS]), indeed the specified amino acids have to be present in the pattern. However, ScanProsite also accepts generalized profiles, which will allow for mismatches between target sequence and pattern.
I am basically looking for certain structural features on several different transmembrane helixes in a family.
Well, I could, but I don't know how well are they working with long deletions (which in my case represent long loops between helixes).
Thanks for pointing that out, I indeed have not read the documentation careful enough!
Could you share the generalized pattern documentation? I now see that this one is down, although the official prosite documentation points there.
There is https://pypi.org/project/regex/ which allows fuzzy matching where you can specify how many "errors" you accept and what type of errors you accept.
Was my first shot (forgot to mention in the original post). Terribly slow :)
Also, could not figure out how to make mismatches work exactly the way I want.
try seqkit @ marin
Seems to also fit, but the code explicitly warns me that more than 4 mismatches would really slow down the process -- but I'll test it anyway, thanks.
Also, are you aware of the compexity of algorithm that they use? They do not say that explicitly, and I'm not really proficient with Go to find it myself.
@ marinegor author shenwei356 regular visitor here.