Hello,
I have different sets of sequences from different sources (e.g. I have around 20 fasta files (each fasta file correspond to one source) where each fasta file contains around 1000 sequences).
I'm interested in identifying sequences that are similar and appear in more than one fasta file. In other words, I might find that sequence A happens to appear in all 20 fasta files, sequence B happens to appear in only 10 fasta files, sequence C happens to appear in a 2 fasta files.
Are there any tools that could do this? If not, any ideas how to tackle this problem in an efficient way?
thanks,
The sequences are not identical. Do you recommend setting a similarity threshold for 2 sequences to be identical?
Perhaps calculate Levenshtein distance between pairs of sequences to build a distance matrix (or apply another distance metric). You might apply a threshold with stringency based on the variety of distances in your matrix. If your population of strings are similar, then the pool of distances will have low values and you'd perhaps want a stringent threshold. If strings are disparate, the pool of distances will have larger values, and a relaxed threshold could be applied to decide similarity. (BLAST will probably be the most efficient approach, not least because there are so many BLAST services out there to do it quickly.)