Finding unique regions in a sequence
2
0
Entering edit mode
2.2 years ago

I have a human sequence and would like to find unique regions within it that are at least 11 amino acids long. Any leads on how to address this?

My current thinking is to use blast to the human protein database and find region(s) that have no similarity to any other sequence in the database, but I am wondering if there are other, more principled ways.

Ultimately, the goal is to find a unique region so one can design an antibody that would be specific to the protein.

seq peptide sequence blast • 892 views
ADD COMMENT
1
Entering edit mode
2.2 years ago

Don't use alignments here. Those are not optimized for finding exact matches at the rate you need.

The simplest method, in my opinion is to break the sequences into 11bp pieces, one piece per line, the sort and uniq rank them like so:

pieces.txt | sort | uniq -c | sort | head
ADD COMMENT
1
Entering edit mode
2.2 years ago

As an example:

import sys
from Bio import SeqIO

# KMER size
SIZE = 11

recs = SeqIO.parse(sys.stdin, format="fasta")
for rec in recs:
    for i in range(0, len(rec.seq)-SIZE):
        seq = rec.seq[i:i+SIZE]
        print (seq)

then you can:

# Get some protein data
wget -nc http://ftp.ensembl.org/pub/current_fasta/accipiter_nisus/pep/Accipiter_nisus.Accipiter_nisus_ver1.0.pep.all.fa.gz

# Chop it into pieces
gunzip -c Accipiter_nisus.Accipiter_nisus_ver1.0.pep.all.fa.gz  | python chop.py > pieces.txt

# How many times did each piece occur
cat pieces.txt | sort | uniq -c | sort -rn > count.txt

looking at the end of the file:

tail count.txt

it prints:

  1 AAAAAAAAAGH
  1 AAAAAAAAAGD
  1 AAAAAAAAAFH
  1 AAAAAAAAAEK
  1 AAAAAAAAADL
  1 AAAAAAAAACR
  1 AAAAAAAAAAX
  1 AAAAAAAAAAN
  1 AAAAAAAAAAM
  1 AAAAAAAAAAD

the bottleneck is sorting, depending on the organism's size.

instead of sorting it as one file, split the counts.txt file into multiple files, sort each file in parallel, the merge sort the result

ADD COMMENT
0
Entering edit mode

Thank you very much for this, it's great. The only issue I see is that the match is exact, and does not take into account similarity, I see that following this approach, for each 11-mer, I would need to get all the similar sequences, and then do as you describe, which would make this take much longer.

I also don't understand why blast wouldn't work here, if an 11-mert of the sequence is really unique, that would be reflected in the blast results wouldn't it?

ADD REPLY

Login before adding your answer.

Traffic: 2593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6