Hi all,
I'm trying to blast two sets of protein against each other to find similarities.
I'm using this command to do so : blastall -d set1.fasta -i set2.fa -p blastp -m 9 -e 0.01 -o results.blast
As the two sets are from the same sepcies, I would like to filter results to get only > 99% identity matching sequences, and with query and subject of same length. After filtering for % of identity sometimes I get results like this one:
Query id, Subject id, % identity, alignment length, mismatches, gap openings, q. start, q. end, s. start, s. end, e-value, bit score
protein_1 protein_2 100.00 76 0 0 1 76 1 76 3e-46 154
protein_1 protein_2 100.00 76 0 0 77 152 1 76 3e-46 154
protein_1 protein_2 100.00 76 0 0 153 228 1 76 3e-46 154
protein_1 protein_2 100.00 76 0 0 229 304 1 76 3e-46 154
Here 4 parts of the protein 1 blast to the same sequence of protein 2. As I only want Hits with protein of the same length I would like to filter out those kinds of results, but I don't know how. Would anyone know a parameter that could do that, or a way to filter the result file?
Thanks,
You don't have information of query and subject sequence lengths in that table so it's not possible. With blast+ you could include qlen and slen in your output rows. I don't know if you can do that with legacy blast..
Thanks, it works well with blast+.
How large are your two sets? Possibly its easier to make simple pairwise alignments of those proteins which have the same length. In Biopython you may use the
pairwise2
module for this task (e.g.alignment = pairwise2.align.globalxx(seq1, seq2, score_only=True)
. For this example the score of the alignment should equal the lenght of the protein if the two proteins are 100% identical).