I've been working with a pseudo-consensus amino acid sequence (pseudo in the sense that it is not necessarily a true/accurate representation of the population due to sampling bias) via the Geneious program. In order to create the consensus sequence, a conservation threshold must be designated. If the proportion of matching residues at a given position exceed the set threshold, the residue is preserved in the consensus. Otherwise, it is listed as unknown. It would better suite my purposes if I could instead list multiple residues that exceed a reduced, more lenient threshold at a given position as a regular expression. Example:
Sequence 1 = ABCABC
Sequence 2 = BBCDAA
Sequence 3 = BBACAC
Sequence 4 = ACAABC
Output with required threshold at >=75% and no regular expressions = XBXXXC
Output with required threshold at >=50% and regular expressions allowed = [AB]B[CA]A[AB]C
My thought is this would be more accommodating for alignment scoring similar to how scoring matrices score residues based on similarity/dissimilarity. I'm trying to optimize alignments of sequences originating from single stranded RNA viruses with high rates of mutation and recombination. I'm also more interested in unique sequences/residues (harder to match/align) than the prevalence of reoccurring residues. This brings me to my question: Are there any alignment programs that accept regular expressions as input?
Edit: I feel it's necessary to emphasize that the odds of coincidental homologous/similar regions within "my" genome are low due to size (15kb in total length)
Why not make a HMM profile instead?
Thanks for the suggestion. Profile HMM is basically what I described, but after reading up on it, I worry it will be overly computationally demanding. Perhaps instead I'll adjust scoring matrices. MAFFT in particular allows for a user defined 248x248 matrix (in alpha testing) which could use unique characters to account for paired residues at a given position within an aligned consensus.
https://mafft.cbrc.jp/alignment/software/textcomparison.html#userdefinedmatrix
What makes you think that? Maybe it missed it, but what is the size of the dataset you're working on (length/number of sequences?)
I need to perform tens of thousands of pairwise alignments in a few hours. Thus far I've been using MAFFT localpair (computes with Smith-Waterman algorithm). In order to preserve computation time, I thought that a simple change to the scoring matrix would be sufficient. However, the required MAFFT version is in alpha and I'm not sure if it's accessible to the public. I also thought that updating MAFFT could resolve an issue with terminal gaps (which I've been discussing with you).
I was overly dismissive of the Profile HMM method. I'll attempt to find a tool that can be deployed through Python.