Counting Motifs In Proteins
2
3
Entering edit mode
13.0 years ago
Nitin ▴ 170

hi all,

I protein sequences in text file as follows

> P1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL

>P2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP
LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE
DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST
ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF
ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA
ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE
EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN
PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR

>P3
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS
NAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKVFNQPKAKAKELGIEICLMYVEIE
KGESVQEELLKGLDNKNPKIIVACIETLRKALSEFGSKIISLKPIIKVLPKLFESRDKAV
RDEAKLFAIEIYRWNRDAVKHTLQNINSVQLKELEEEWVKLPTGAPKPSRFLRSQQELEA
KLEQQQSAGGDAEGGGDDGDEVPQVDAYELLDAVEILSKLPKDFYDKIEAKKWQERKEAL
EAVEVLVKNPKLEAGDYADLVKALKKVVGKDTNVMLVALAAKCLTGLAVGLRKKFGQYAG
HVVPTILEKFKEKKPQVVQALQEAIDAIFLTTTLQNISEDVLAVMDNKNPTIKQQTSLFI
ARSFRHCTSSTLPKSLLKPFCAALLKHINDSAPEVRDAAFEALGTALKVVGEKSVNPFLA

...

in total 100 sequences In these sequences i searched a motif of interest using python script as follows

import re
infile=open("seq.fasta",'r')
out=open("results.csv",'w')
pattern=re.compile(r"(P[A-Z]{2}P)")
for line in infile:
   line = line.strip("\n")
   if line.startswith('>'):
      name=line
   else:      
      s = re.findall(pattern,line)
      print '%s:%s' %(name,s)
      out.write('%s:\t%s\n' %(name,s))

This script perfectly worked it gave me desired motif i wanted...now i wanted to count motif of interest in each seqeunce present out put of the script is as follows

>p1: 
PGCP

>p1: 
PHCP,
PKCP

and so on

but I want out put as follows

>p1:
1

>p1: 
2

Can anybody tell me how to do this using python?

Thanks in Advance

Ni

python • 5.5k views
ADD COMMENT
5
Entering edit mode
13.0 years ago

You are very close... Just notice that you can get the total number of occurrences by looking at the length of "s".

For example:

print '%s:%s (%s hits)' %(name,s, len(s))

instead of

print '%s:%s' %(name,s)
ADD COMMENT
2
Entering edit mode
13.0 years ago

You may also be interested in the SLiMSearch software described here. It allows you to search for regular expressions in protein sequences. Video lecture.

Software available here: http://www.soton.ac.uk/~re1u06/software/slimsuite/

ADD COMMENT

Login before adding your answer.

Traffic: 2880 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6