Question

percent identities for all by all protein alignment

0

Entering edit mode

6.3 years ago

bioguy ▴ 50

Very simple question – I have a list of, say, 1000 protein sequences in a fasta file. I want to compute percent identities for each pair of sequences. So, in this example, it'd generated a 1000 by 1000 matrix reflected over the diagonal.

I've tried diamond and blastp, but even when I set -max-target-seqs to 0 or set the evalue/min percent identity to appropriately high/low values, they still don't report all possible alignments. There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

Any recommendations for how to get this done quickly? I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

Thanks in advance.

blast diamond sequence alignment • 5.2k views

ADD COMMENT • link updated 6.2 years ago by fishgolden ▴ 520 • written 6.3 years ago by bioguy ▴ 50

0

Entering edit mode

6.2 years ago

Joe 22k

Neither DIAMOND nor BLAST are the tool for this really.

Just iterate over your file in an exhaustive pairwise approach, and use a normal aligner like MAFFT or CLUSTAL which can give you a score.

I'd advise looking at ways to parallelize this as you have an n choose k == 499500 problem here.

ADD COMMENT • link 6.2 years ago by Joe 22k

1

Entering edit mode

Thank you – ended up going the CLUSTAL route with some basic multi-threading to speed things up. Very much appreciate your help!

ADD REPLY • link 6.2 years ago by bioguy ▴ 50

0

Entering edit mode

No problem ;)

ADD REPLY • link 6.2 years ago by Joe 22k

0

Entering edit mode

CLUSTAL and MAFFT are Multiple Sequence Alignment software that they perform stringent DP (edit: jrj.healey said CLUSTAL's alignment is a bit janky Its alignment may not be stringent one...) same as water and needle when they are used as pairwise aligner. I mean, I think CLUSTAL and MAFFT are over spec. But it is ok because the results would be the same when the same parameters were used, i guess.

ADD REPLY • link 6.2 years ago by fishgolden ▴ 520

1

Entering edit mode

Clustal begins by performing all vs all pairwise alignments anyway, but other threads on the site have shown that its pairwise alignments can be a bit janky. Using needle or water directly is likely a good approach too. In my experience, MAFFT gives better DNA alignments than does CLUSTAL anyway.

(Those were just the first 2 to come to mind, there are many other aligners to experiment with :) )

ADD REPLY • link 6.2 years ago by Joe 22k

0

Entering edit mode

but other threads on the site have shown that its pairwise alignments can be a bit janky.

Oh, really. I didn't know that. Thank you for the comment.

ADD REPLY • link 6.2 years ago by fishgolden ▴ 520

0

Entering edit mode

I'm mainly referring to this one: what is the problem with using clustal to do pairwise alignment?

But its far from a systematic evaluation, so YMMV.

ADD REPLY • link 6.2 years ago by Joe 22k

0

Entering edit mode

In Clustal Omega, the alignments are then computed using the very accurate HHalign package (Söding, 2005), which aligns two profile hidden Markov models (Eddy, 1998).

http://msb.embopress.org/content/7/1/539

It seems that Clustal Omega uses HHalign.

A distance is calculated between every pair of sequences and these are used to construct the dendrogram which guides the final multiple alignment. The scores are calculated from separate pairwise alignments. These can be calculated using 2 methods: dynamic programming (slow but accurate) or by the method of Wilbur and Lipman (extremely fast but approximate).

ClustalW is using DP by default.

http://www.clustal.org/download/clustalw_help.txt

ADD REPLY • link 6.2 years ago by fishgolden ▴ 520

0

Entering edit mode

Ah ok, thank you. Will give the others you recommended a try as well if need be. Really appreciate all the help!

ADD REPLY • link 6.2 years ago by bioguy ▴ 50

score 2 · Accepted Answer · 2019-05-16

There is some sort of filtering still going on, it would appear. Hopefully I'm just missing some parameter?

I'd hard code it myself but I'd rather have one of these algorithms do it for me, as they'll presumably be faster at full scale.

I think this is really difficult task since the reason why those programs (at least BLAST) are fast is that they have some filtering steps.

In my opinion, one of the critical filters still remaining in your protocol with BLAST is that the "word size" & I don't know about DIAMOND very much, sorry.

Anyway, I recommend water or needle in EMBOSS package http://emboss.sourceforge.net/ . Or SSEARCH, or GGSEARCH in FASTA package https://github.com/wrpearson/fasta36 for your purpose.