All Pairs Similarity Search Under A Substitution Matrix For A Large Number Of Short Protein Sequences
2
2
Entering edit mode
12.9 years ago
Harish Tella ▴ 20

I'm looking for a way to find similar pairs of protein sequences within a large set. (10,000-100,000 sequences, and maybe more) The sequence lengths are between 8-15 amino acids.

I would like similarity to be defined as score under a substituion matrix. It would be great to be able to handle pairs of different lengths as well.


I've come accross SlideSort. It can do all pairs similarity search for millions of sequences but only in terms of edit distance not score under a substitution matrix.


I've also come accross a few papers that give a method for creating a hamming space embedding of protein sequences under a substitution matrix:

Detecting protein sequence conservation via metric embeddings

Provably sensitive indexing strategies for biosequence similarity search.

Efficient large-scale sequence comparison by locality-sensitive hashing

The embedding is a mapping from protein sequence to bit vector such that the hamming distance between two bit vectors is related to the score of the two protein sequences that map to those two bit vectors.

However these bit vectors representations can be very large and are far apart in distance. So in order to find pairs that are similar the author creates a filter by sampling a few bits (let's say 100) from random positions in the hamming space embedding of all the sequences. Those sequences that have the same sample bits get binned together and get thoroughly compared using the substituion matrix.

The filter lets in some false positives and screens out some false negatives. But its not a problem because the false positives get screened when you do the expensive comparison using the substitution matrix. And the false negative rate can be controlled with the number of iterations and the number of sample bits.

The down side of this approach is that I can only compare sequences of equal length. And its not an exhuastive search.


Thanks a megabase

protein similarity • 6.6k views
ADD COMMENT
0
Entering edit mode

I should clarify that I only want pairs of sequences from the set that have a similarity score above a certain threshold. And I want to avoid doing an all by all comparison to find these pairs.

ADD REPLY
0
Entering edit mode

To determine the scores so you can apply a threshold, you will have to perform the actual comparisons. As noted in my answer there are ways of reducing the number of comparisons that are required, which may help, and depending on the available compute performing the comparisons may not take too long.

ADD REPLY
4
Entering edit mode
12.9 years ago
Hamish ★ 3.3k

This sounds like you want to perform an all-against-all comparison (also known as all-vs-all). Using the more generic pairwise sequence alignment tools will address the size issue and the scoring matrix issue.

As I first step I recommend identifying identical sequences and ensuring each unique sequence occurs only once when performing the alignment steps. Since the alignments you get a deterministic there is no point in repeating the same alignment multiple times. See "How to remove the same sequences in the FASTA files?" for details of various methods to perform this type of identification and identity collapse/removal.

Given the size of your sequences I would avoid BLAST for this, it tends to behave badly with sort peptide sequences, and would instead look at FASTA, Smith-Waterman or Needleman-Wunsch. Implementations of all of these methods are available in the FASTA suite as 'fasta', 'ssearch' and 'ggsearch' there is also 'lalign' which implements the Waterman-Eggert modification of Smith-Waterman. Which method you want to use depends on the nature of your sequences, FASTA and Smith-Waterman are local sequence alignments and so will find the highest scoring regions of similarity between two sequences. In contrast Needleman-Wunsch gives global (end-to-end) sequence alignments and so finds the highest scoring pairing given a full length alignment, in GGSEARCH alignments are only performed if the subject is within 80% to 120% of the length of the query sequence since more than this tends to give statistically insignificant global alignments.

lalign allows you to run a complete all-against-all with something like:

lalign -q -p -O output.lalign pepSeqSet.tfa pepSeqSet.tfa

This runs linearly, which given your number of sequences will likely take a bit longer than you would like. So to get a bit of parallel processing, I would split the set into chunks and use each chunk as the query set against the complete set as the target:

lalign -q -p -O pepSeqSet_1.lalign pepSeqSet_1.tfa pepSeqSet.tfa
lalign -q -p -O pepSeqSet_2.lalign pepSeqSet_2.tfa pepSeqSet.tfa

And collate the results when doing the parsing to extract the relevant information.

Since 'fasta', 'ssearch' and 'ggsearch' are designed as database search programs, they use thresholds to limit the output to statistically significant matches. Since in your case you want all the pairwise scores this means that the traditional approach would be to run each sequence pair explicitly when using these tools with a high E-value threshold specified to allow poor matches through, for example:

ssearch36 -q -p -E 1000 -O seq_1_vs_seq_1.ssearch seq_1.tfa seq_1.tfa
ssearch36 -q -p -E 1000 -O seq_1_vs_seq_2.ssearch seq_1.tfa seq_2.tfa
ssearch36 -q -p -E 1000 -O seq_1_vs_seq_3.ssearch seq_1.tfa seq_3.tfa

However the FASTA tools support options to control the amount of output that lets you override the threshold behavior and explicitly state how many matches you require per-database search. For example with a database of 100 entries I can specify that 100 matches should be reported for each query, regardless of their statistical significance:

ssearch36 -q -p -b '=100' -d '=100' -O pepSeqSet.ssearch pepSeqSet.tfa pepSeqSet.tfa

The -b and -d options control the number of matches in the summary score table and the number of alignments reported respectively. The use of the '=' prefix requests that this number of scores/alignments be reported regardless of the E-value threshold.

From the question it sounds like the data set is still growing, so storing the relevant information from the alignment results, so you only need to run alignments for the new sequences (i.e. new sequences vs. existing sequences and new sequences vs. new sequences) is a requirement. In the FASTA tools you can use the '-m 9C' option to obtain a scores table with some additional information, including the alignment encoded as a CIGAR string. This information can be used to recreate the original alignment if required. If using the '-m 9C' output you may want to disable output of the full alignments by using '-d 0'.

Looking through the documentation for the current FASTA release (and asking Bill a couple of questions), there is a special value for '-b' which will report all the matches retained by the programs (currently 60,000 by default):

ssearch36 -q -p -b '$' -d '0' -m 9C pepSeqSet.tfa pepSeqSet.tfa > pepSeqSet.ssearch

Will give a -m 9C table for each query sequence which contains up to 60,000 of the best matches. Given this a a good proportion of the size of your data set, this could be a viable method for getting the best scores.

While similar tools are available in the EMBOSS suite (e.g. needle, needleall, stretcher, water and matcher), the FASTA suite implementations use accelerations on modern processors that make them significantly faster. So for the volume of sequences involved using the FASTA suite tools makes more sense.

Update: estimating run time

You are understandably concerned about how long this would take for the 100,000 sequences. So let's perform some tests to estimate how long it would take to perform a complete all-against-all for 100,000 15aa sequences.

First we need a database, for simplicity I'll generate one using the EMBOSS tool makeprotseq which contains 100,000 random sequences of 15aa:

makeprotseq -auto -amount 100000 -length 15 -out prot_test_100000_15aa.tfa

For the purposes of estimating I don't want to run a complete analysis, so I'll use the first 100 sequences from the database as the query set (in this case I'm utilising the fact that I know that in the 'makeprotseq' output a 15aa sequence only occupies 2 lines in fasta sequence format):

head -200 prot_test_100000_15aa.tfa > prot_test_100_15aa.tfa

Now I can perform a sample search and time it:

time ssearch36 -q -p -m 9C -b '$' -d '0' prot_test_100_15aa.tfa prot_test_100000_15aa.tfa > test_2.ssearch

For me this 100 sequences vs. 100,000 sequences set of searches takes: 3:12.89. This is 192.89s for 100 searches, so a single search takes approx. 1.928s. Thus for the full set of 100,000 searches it would take 192,890s => 53.6hr => a little over two days.

Chunking the query set and/or the database so the individual chunks can be run in parallel allows the elapsed time to be reduced. So for example if you ran 100 chunks and ran them in an environment where they were run simultaneously this analysis would potentially take:

53.6 hr / 100 processes = 0.536 hr

Obviously this depends on the compute resources available to you, in this case I'm using a fairly average server system. For contrast a more powerful new system completes the benchmark search set in 1:15.64 which means the whole set would take only 75,640s => 21hr when run serially.

This produces the complete all-against-all scoring matrix, however if you only need the alignment score you can use the fact that since seq_1 vs. seq_2 gives the same score as seq_2 vs. seq_1 only a single comparison per-pair is required. So the number of alignments is not the square matrix:

100,000 x 100,000 = 10,000,000,000 alignments

but is instead the triangular matrix (see triangular number):

(100,000 * (100,000 + 1)) / 2 = 5,000,050,000 alignments

Adjusting the database for each and every search is very likely to be more costly than performing some additional comparisons, however when chunking both the database and the queries this can be taken into account to eliminated some chunks.

For what it's worth this database is actually quite small, you are only talking 100,000 sequences containing up to 1,500,000 residues. The UniProtKB/SwissProt database currently contains 534,695 entries containing 189,667,883 residues and it is much smaller than the full UniProtKB database (approx. 20.5 million sequences comprising 6.8 billion residues). Projects such as SIMAP, Pairs DB and ProteinWorldDB are based around all-against-all sequence comparisons for larger databases.

ADD COMMENT
0
Entering edit mode

+1 for mentioning ssearch.

ADD REPLY
0
Entering edit mode

Thanks, but I should have mentioned I want to avoid doing an all by all comparison. Doing (100,000 x 100,000) alignments would take a really long time -- days, I would think.

ADD REPLY
0
Entering edit mode
12.9 years ago
Harish Tella ▴ 20

I've found a method and implementation that already solves this problem. Using it I can quickly find similar pairs without doing an all by all comparison.

Check out these papers:

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

PFMFind: a system for discovery of peptide homology and function

The code for PFMFind is available at: http://git.stojmirovic.org/PFMFind.git

Note from Dr. Stojmirovic:

At this stage, I am not sure if you only need the indexing scheme (FSindex, written in C) or the entire PFMFind, which contains a GUI, a database schema and a protocol for communicating between the components. The database backend is Postgresql (I think it would be relatively easy to add some code for sqlite3 these days) and setting up the entire system can be tedious. On the other hand, constructing FSindexes is quite easy - all you need is a fasta file and a desired alphabet partition.

You only need the FSIndex and it's python bindings to solve the original problem.

There is also a way to allow a single gap when searching the index. Message me for more info.

Some instructions for getting going:

git clone http://git.stojmirovic.org/PFMFind.git

As soon as you clone it, you need to switch to the modern-deps branch

cd PFMFind

git checkout modern-deps

This repository contains all my code for FSindex/PFMFind. To construct the distribution for the Python package, you must edit the Makefile in the FSindex directory to set the PYDIR variable - it should point to the include directory containing your Python headers. Then do

make swig

and then change into FSindex/src and create a distribution with

cd FSindex/src

python setup.py sdist

Alternatively, you can copy (or symlink) FSindex/src/swig/FS.so into FSindex/src/ShortFrags/Expt and place FSindex/src/ShortFrags into your Python path.

I am assuming you are using Unix and you know how to install and work with Python packages.

The executable scripts are in FSindex/src/ShortFrags/scripts - I think the docs could still be useful on how to use them.

As for the Python packages that are dependences, here is what I'm using Pmw==1.3.2 biopython==1.58 numpy==1.6.1 psycopg2==2.4.4 scipy==0.9.0

ADD COMMENT

Login before adding your answer.

Traffic: 3871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6