Given the size of your sequences I would avoid BLAST for this, it tends to behave badly with sort peptide sequences, and would instead look at FASTA, Smith-Waterman or Needleman-Wunsch. Implementations of all of these methods are available in the FASTA suite as 'fasta', 'ssearch' and 'ggsearch' there is also 'lalign' which implements the Waterman-Eggert modification of Smith-Waterman. Which method you want to use depends on the nature of your sequences, FASTA and Smith-Waterman are local sequence alignments and so will find the highest scoring regions of similarity between two sequences. In contrast Needleman-Wunsch gives global (end-to-end) sequence alignments and so finds the highest scoring pairing given a full length alignment, in GGSEARCH alignments are only performed if the subject is within 80% to 120% of the length of the query sequence since more than this tends to give statistically insignificant global alignments.
lalign
allows you to run a complete all-against-all with something like:
lalign -q -p -O output.lalign pepSeqSet.tfa pepSeqSet.tfa
This runs linearly, which given your number of sequences will likely take a bit longer than you would like. So to get a bit of parallel processing, I would split the set into chunks and use each chunk as the query set against the complete set as the target:
lalign -q -p -O pepSeqSet_1.lalign pepSeqSet_1.tfa pepSeqSet.tfa
lalign -q -p -O pepSeqSet_2.lalign pepSeqSet_2.tfa pepSeqSet.tfa
And collate the results when doing the parsing to extract the relevant information.
Since 'fasta', 'ssearch' and 'ggsearch' are designed as database search programs, they use thresholds to limit the output to statistically significant matches. Since in your case you want all the pairwise scores this means that the traditional approach would be to run each sequence pair explicitly when using these tools with a high E-value threshold specified to allow poor matches through, for example:
ssearch36 -q -p -E 1000 -O seq_1_vs_seq_1.ssearch seq_1.tfa seq_1.tfa
ssearch36 -q -p -E 1000 -O seq_1_vs_seq_2.ssearch seq_1.tfa seq_2.tfa
ssearch36 -q -p -E 1000 -O seq_1_vs_seq_3.ssearch seq_1.tfa seq_3.tfa
However the FASTA tools support options to control the amount of output that lets you override the threshold behavior and explicitly state how many matches you require per-database search. For example with a database of 100 entries I can specify that 100 matches should be reported for each query, regardless of their statistical significance:
ssearch36 -q -p -b '=100' -d '=100' -O pepSeqSet.ssearch pepSeqSet.tfa pepSeqSet.tfa
The -b
and -d
options control the number of matches in the summary score table and the number of alignments reported respectively. The use of the '=' prefix requests that this number of scores/alignments be reported regardless of the E-value threshold.
From the question it sounds like the data set is still growing, so storing the relevant information from the alignment results, so you only need to run alignments for the new sequences (i.e. new sequences vs. existing sequences and new sequences vs. new sequences) is a requirement. In the FASTA tools you can use the '-m 9C' option to obtain a scores table with some additional information, including the alignment encoded as a CIGAR string. This information can be used to recreate the original alignment if required. If using the '-m 9C' output you may want to disable output of the full alignments by using '-d 0'.
Looking through the documentation for the current FASTA release (and asking Bill a couple of questions), there is a special value for '-b' which will report all the matches retained by the programs (currently 60,000 by default):
ssearch36 -q -p -b '$' -d '0' -m 9C pepSeqSet.tfa pepSeqSet.tfa > pepSeqSet.ssearch
Will give a -m 9C
table for each query sequence which contains up to 60,000 of the best matches. Given this a a good proportion of the size of your data set, this could be a viable method for getting the best scores.
While similar tools are available in the EMBOSS suite (e.g. needle, needleall, stretcher, water and matcher), the FASTA suite implementations use accelerations on modern processors that make them significantly faster. So for the volume of sequences involved using the FASTA suite tools makes more sense.
Update: estimating run time
You are understandably concerned about how long this would take for the 100,000 sequences. So let's perform some tests to estimate how long it would take to perform a complete all-against-all for 100,000 15aa sequences.
First we need a database, for simplicity I'll generate one using the EMBOSS tool makeprotseq
which contains 100,000 random sequences of 15aa:
makeprotseq -auto -amount 100000 -length 15 -out prot_test_100000_15aa.tfa
For the purposes of estimating I don't want to run a complete analysis, so I'll use the first 100 sequences from the database as the query set (in this case I'm utilising the fact that I know that in the 'makeprotseq' output a 15aa sequence only occupies 2 lines in fasta sequence format):
head -200 prot_test_100000_15aa.tfa > prot_test_100_15aa.tfa
Now I can perform a sample search and time it:
time ssearch36 -q -p -m 9C -b '$' -d '0' prot_test_100_15aa.tfa prot_test_100000_15aa.tfa > test_2.ssearch
For me this 100 sequences vs. 100,000 sequences set of searches takes: 3:12.89. This is 192.89s for 100 searches, so a single search takes approx. 1.928s. Thus for the full set of 100,000 searches it would take 192,890s => 53.6hr => a little over two days.
Chunking the query set and/or the database so the individual chunks can be run in parallel allows the elapsed time to be reduced. So for example if you ran 100 chunks and ran them in an environment where they were run simultaneously this analysis would potentially take:
53.6 hr / 100 processes = 0.536 hr
Obviously this depends on the compute resources available to you, in this case I'm using a fairly average server system. For contrast a more powerful new system completes the benchmark search set in 1:15.64 which means the whole set would take only 75,640s => 21hr when run serially.
This produces the complete all-against-all scoring matrix, however if you only need the alignment score you can use the fact that since seq_1 vs. seq_2 gives the same score as seq_2 vs. seq_1 only a single comparison per-pair is required. So the number of alignments is not the square matrix:
100,000 x 100,000 = 10,000,000,000 alignments
but is instead the triangular matrix (see triangular number):
(100,000 * (100,000 + 1)) / 2 = 5,000,050,000 alignments
Adjusting the database for each and every search is very likely to be more costly than performing some additional comparisons, however when chunking both the database and the queries this can be taken into account to eliminated some chunks.
For what it's worth this database is actually quite small, you are only talking 100,000 sequences containing up to 1,500,000 residues. The UniProtKB/SwissProt database currently contains 534,695 entries containing 189,667,883 residues and it is much smaller than the full UniProtKB database (approx. 20.5 million sequences comprising 6.8 billion residues). Projects such as SIMAP, Pairs DB and ProteinWorldDB are based around all-against-all sequence comparisons for larger databases.
I should clarify that I only want pairs of sequences from the set that have a similarity score above a certain threshold. And I want to avoid doing an all by all comparison to find these pairs.
To determine the scores so you can apply a threshold, you will have to perform the actual comparisons. As noted in my answer there are ways of reducing the number of comparisons that are required, which may help, and depending on the available compute performing the comparisons may not take too long.