I have a human sequence and would like to find unique regions within it that are at least 11 amino acids long. Any leads on how to address this?
My current thinking is to use blast to the human protein database and find region(s) that have no similarity to any other sequence in the database, but I am wondering if there are other, more principled ways.
Ultimately, the goal is to find a unique region so one can design an antibody that would be specific to the protein.
import sys
from Bio import SeqIO
# KMER size
SIZE = 11
recs = SeqIO.parse(sys.stdin, format="fasta")
for rec in recs:
for i in range(0, len(rec.seq)-SIZE):
seq = rec.seq[i:i+SIZE]
print (seq)
then you can:
# Get some protein data
wget -nc http://ftp.ensembl.org/pub/current_fasta/accipiter_nisus/pep/Accipiter_nisus.Accipiter_nisus_ver1.0.pep.all.fa.gz
# Chop it into pieces
gunzip -c Accipiter_nisus.Accipiter_nisus_ver1.0.pep.all.fa.gz | python chop.py > pieces.txt
# How many times did each piece occur
cat pieces.txt | sort | uniq -c | sort -rn > count.txt
Thank you very much for this, it's great. The only issue I see is that the match is exact, and does not take into account similarity, I see that following this approach, for each 11-mer, I would need to get all the similar sequences, and then do as you describe, which would make this take much longer.
I also don't understand why blast wouldn't work here, if an 11-mert of the sequence is really unique, that would be reflected in the blast results wouldn't it?
Thank you very much for this, it's great. The only issue I see is that the match is exact, and does not take into account similarity, I see that following this approach, for each 11-mer, I would need to get all the similar sequences, and then do as you describe, which would make this take much longer.
I also don't understand why blast wouldn't work here, if an 11-mert of the sequence is really unique, that would be reflected in the blast results wouldn't it?