Question

Any alignment programs that accept regular expressions within input sequences?

0

Entering edit mode

6.9 years ago

Ghoti ▴ 90

I've been working with a pseudo-consensus amino acid sequence (pseudo in the sense that it is not necessarily a true/accurate representation of the population due to sampling bias) via the Geneious program. In order to create the consensus sequence, a conservation threshold must be designated. If the proportion of matching residues at a given position exceed the set threshold, the residue is preserved in the consensus. Otherwise, it is listed as unknown. It would better suite my purposes if I could instead list multiple residues that exceed a reduced, more lenient threshold at a given position as a regular expression. Example:

Sequence 1 = ABCABC

Sequence 2 = BBCDAA

Sequence 3 = BBACAC

Sequence 4 = ACAABC

Output with required threshold at >=75% and no regular expressions = XBXXXC

Output with required threshold at >=50% and regular expressions allowed = [AB]B[CA]A[AB]C

My thought is this would be more accommodating for alignment scoring similar to how scoring matrices score residues based on similarity/dissimilarity. I'm trying to optimize alignments of sequences originating from single stranded RNA viruses with high rates of mutation and recombination. I'm also more interested in unique sequences/residues (harder to match/align) than the prevalence of reoccurring residues. This brings me to my question: Are there any alignment programs that accept regular expressions as input?

Edit: I feel it's necessary to emphasize that the odds of coincidental homologous/similar regions within "my" genome are low due to size (15kb in total length)

alignment regular expression • 1.5k views

ADD COMMENT • link 6.9 years ago by Ghoti ▴ 90

1

Entering edit mode

Why not make a HMM profile instead?

ADD REPLY • link 6.9 years ago by Joe 21k

0

Entering edit mode

Thanks for the suggestion. Profile HMM is basically what I described, but after reading up on it, I worry it will be overly computationally demanding. Perhaps instead I'll adjust scoring matrices. MAFFT in particular allows for a user defined 248x248 matrix (in alpha testing) which could use unique characters to account for paired residues at a given position within an aligned consensus.

https://mafft.cbrc.jp/alignment/software/textcomparison.html#userdefinedmatrix

ADD REPLY • link 6.8 years ago by Ghoti ▴ 90

0

Entering edit mode

What makes you think that? Maybe it missed it, but what is the size of the dataset you're working on (length/number of sequences?)

ADD REPLY • link 6.8 years ago by Joe 21k

0

Entering edit mode

I need to perform tens of thousands of pairwise alignments in a few hours. Thus far I've been using MAFFT localpair (computes with Smith-Waterman algorithm). In order to preserve computation time, I thought that a simple change to the scoring matrix would be sufficient. However, the required MAFFT version is in alpha and I'm not sure if it's accessible to the public. I also thought that updating MAFFT could resolve an issue with terminal gaps (which I've been discussing with you).

I was overly dismissive of the Profile HMM method. I'll attempt to find a tool that can be deployed through Python.

ADD REPLY • link 6.8 years ago by Ghoti ▴ 90