Hi all,
I have some 70 protein domains that have been found using HMMER3 that have shorter residues,the domains are short of 3 or 4 residues than that of the domains in the database.I want to write a program in biopython to retrieve the missing residues
First i would like to split my part of the sequence from the original database sequence but i have problems with this.
from Bio import SeqIO
db1 = "sample_db.fasta" # contains db_records
db2 = "sample.fasta" # contains my result
dbase_dict = SeqIO.read(db1, "fasta")
my_record_dict = SeqIO.read(db2, "fasta")
for record in my_record_dict :
if record in dbase_dict:
print dbase_dict.seq.split(my_record_dict.seq)
rec_dbase = dbase_dict[record]
rec_mine = my_record_dict[record]
list_seq = rec_dbase.seq.split("rec_mine.seq")
i would like to get list_seq = [ 'ARR' 'KEFIMAELIQTEKAYVRDLRECMDTYLWEMTSGVE' 'EIP' ] and then i would use strip command to retrieve the first and the last 3 residues .But split does not work.
ADD COMMENT
• link
updated 12.9 years ago by
Eric T.
★
2.8k
•
written 12.9 years ago by
Hari
▴
280
0
Entering edit mode
UniProt is a repository of proteins, not domains (though SMART/PFAM/...) may be linked. So please clarify this part. Can't you use the full domains from the source dbs once you have identified them?
So one thing you could do to force your protein into the domain model is to count the "-" at the beginning and the end and include the correct number of residues.
in uniprot: KKEFIMAELLQTEKAYVRDLHECLETYLWEMTSGVEEIPPGILNKEHIIFGNIQEIYDFH NNIFLKELEKYEQLPEDVGHCFVTWADKFQMYVTYCKNKPDSNQLILEHAGTFFDEIQQR HGLANSISSYLIKPVQRITKYQLLLKELLTCCEEGKGELKDGLEVMLSVPKKANDA
when you compare them the first 6 residues and the last 4 residues are missed by HMMER search completely,how do i overcome this problem?
Problem: extract just the 255-aa protein kinase domains from a set of 70 sequences.
You don't really need to use the split() method; slicing the original sequence should do it once you've calculated the sequence coordinates.
If you're already parsing some of HMMer's plain-text output, you can look at the fields "hmmfrom" and "hmm to" to see how many positions are missing from the start and end of the sequence, then look at the "alifrom" and "ali to" fields to determine the positions of the original sequence that matched the HMM profile. Then, simple arithmetic to get the coordinates in the original sequence:
The sequence coordinates of the aligned region also appear in the Stockholm output, in the sequence identifier after the slash character -- so you can also count the number of terminal dashes, minus any leading or trailing insert (lowercase) characters.
Keep in mind that for some purposes, the missing sequence positions might not be meaningful -- e.g. many biologically functional kinases are missing one or more subdomains, so the sequence portions that were left out by the HMM profile could be phylogenetically or structurally unrelated to the typical kinase subdomain. Though if the missing portions are only 3-4 aa I think you're fine, it's just HMMer being stubborn.
I dunno much about HMMER but you could attempt a biopython solution along these lines (since your hmmer result sequence is "included" in the database sequence):
from Bio import SeqIO
import re
hmmer_seq_string = str(hmmer_result.seq)
dbase_seq_string = str(dbase_record.seq)
match = re.search(hmmer_seq_string, dbase_seq_string)
s = match.start()
e = match.end()
first_residues = dbase_seq_string[:s]
last_residues = dbase_seq_string[e:]
print first_residues
print last_residues
UniProt is a repository of proteins, not domains (though SMART/PFAM/...) may be linked. So please clarify this part. Can't you use the full domains from the source dbs once you have identified them?
yes but we just need the domains not the whole sequences extracting it with the ids is painful,i m trying to write some code that can do this.