finding the frequency of a motif-matching peptide
1
0
Entering edit mode
2.0 years ago
andrea • 0

Hi everyone, I want to find the frequency of each motif-matching peptide.

> Sequence_1
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL

>Sequence_2
MAEVRKFTKRLSKPGTAAELRQSVSEAVRGSVVLEKAKLVEPLDYENVITQRKTQIYSDP
LRDLLMFPMEDISISVIGRQRRTVQSTVPEDAEKRAQSLFVKECIKTYSTDWHVVNYKYE
DFSGDFRMLPCKSLRPEKIPNHVFEIDEDCEKDEDSSSLCSQKGGVIKQGWLHKANVNST
ITVTMKVFKRRYFYLTQLPDGSYILNSYKDEKNSKESKGCIYLDACIDVVQCPKMRRHAF
ELKMLDKYSHYLAAETEQEMEEWLIMLKKIIQINTDSLVQEKKDTVEAIQEEETSSQGKA
ENIMASLERSMHPELMKYGRETEQLNKLSRGDGRQNLFSFDSEVQRLDFSGIEPDVKPFE
EKCNKRFMVNCHDLTFNILGHIGDNAKGPPTNVEPFFINLALFDVKNNCKISADFHVDLN
PPSVREMLWGTSTQLSNDGNAKGFSPESLIHGIAESQLCYIKQGIFSVTNPHPEIFLVVR

>Sequence_3
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS
NAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKVFNQPKAKAKELGIEICLMYVEIE
KGESVQEELLKGLDNKNPKIIVACIETLRKALSEFGSKIISLKPIIKVLPKLFESRDKAV
RDEAKLFAIEIYRWNRDAVKHTLQNINSVQLKELEEEWVKLPTGAPKPSRFLRSQQELEA
KLEQQQSAGGDAEGGGDDGDEVPQVDAYELLDAVEILSKLPKDFYDKIEAKKWQERKEAL
EAVEVLVKNPKLEAGDYADLVKALKKVVGKDTNVMLVALAAKCLTGLAVGLRKKFGQYAG
HVVPTILEKFKEKKPQVVQALQEAIDAIFLTTTLQNISEDVLAVMDNKNPTIKQQTSLFI
ARSFRHCTSSTLPKSLLKPFCAALLKHINDSAPEVRDAAFEALGTALKVVGEKSVNPFLA

> Sequence_4
MPPRRSIVEVKVLDVQKRRVPNKHYVYIIRVTWSSGATEAIYRRYSKFFDLQMQMLDKFP
MEGGQKDPKQRIIPFLPGKILFRRSHIRDVAVKRLIPIDEYCKALIQLPPYISQCDEVLQ
FFETRPEDLNPPKEEHIGKKKSGNDPTSVDPMVLEQYVVVADYQKQESSEISLSVGQVVD
IIEKNESGWWFVSTAEEQGWVPATCLEGQDGVQDEFSLQPEEEEKYTVIYPYTARDQDEM
NLERGAVVEVVQKNLEGWWKIRYQGKEGWAPASYLKKNSGEPLPPKLGPSSPAHSGALDL
DGVSRHQNAMGREKELLNNQRDGRFEGRLVPDGDVKQRSPKMRQRPPPRRDMTIPRGLNL 


>Sequence_5
GDDSEWLKLPVDQKCEHKLWKARLSGYEEALKIFQKIKDEKSPEWSKYLGLIKKFVTDS
NAVVQLKGLEAALVYVENAHVAGKTTGEVVSGVVSKVFNQPKAKAKELGIEICLMYVEIE
KGESVQEELLKGLDNKNPKIIVACIETLRKALSEFGSKIISLKPIIKVLPKLFESRDKAV
RDEAKLFAIEIYRWNRDAVKHTLQNINSVQLKELEEEWVKLPTGAPKPSRFLRSQQELEA
KLEQQQSAGGDAEGGGDDGDEVPQVDAYELLDAVEILSKLPKDFYDKIEAKKWQERKEAL
EAVEVLVKNPKLEAGDYADLVKALKKVVGKDTNVMLVALAAKCLTGLAVGLRKKFGQYAG
HVVPTILEKFKEKKPQVVQALQEAIDAIFLTTTLQNISEDVLAVMDNKNPTIKQQTSLFI
ARSFRHCTSSTLPKSLLKPFCAALLKHINDSAPEVRDAAFEALGTALKVVGEKSVNPFLA

The fasta file actually has more sequences, I just chose the top 5, to put on here.

The code is below:

import Bio
import regex

from Bio import SeqIO
input_file = 'Sequences.fasta'
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
     name, sequence = fasta.id, str(fasta.seq)
     matches=regex.finditer(r"(P[A-Z]{2}P..)",sequence)
     for m in matches:
        print(name, m.start(), m.end(), m.group())

This is giving me my desired output. which is below:

Sequence1 73 79 PFLPGK
Sequence1 281 287 PLPPKL
Sequence1 288 294 PSSPAH
Sequence4 73 79 PFLPGK
Sequence4 281 287 PLPPKL
Sequence4 288 294 PSSPAH

Now I want to:

(1) Calculate the number of sequences in that file. I know in linux environment I would have to grep -c ">" Sequences.fasta, but how do I do it in python? Basically I want to count how many sequences are there in total.

(2) Once I know how many sequences are in file in total, I want to count how many of those sequences have the matching peptides above for example how many sequences have PFLPGK at positions 73 79. In the example above PFLPGK its in sequence 1 & 4, so the frequency is 2, meaning that out of the 5 sequences above only two have PFLPGK. So how do I calculate that using python?

I know this is a lot of questions, please bare with me, I am still in the learning process for this. Many thanks for your assistance.

frequency multifasta motif biopython python • 923 views
ADD COMMENT
3
Entering edit mode
2.0 years ago
iraun 6.2k
import Bio
import re
import collections
from Bio import SeqIO
input_file = 'seq.fa'
num_sequences = 0
match_list = []
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
     num_sequences+=1
     name, sequence = fasta.id, str(fasta.seq)
     matches=re.finditer(r"(P[A-Z]{2}P..)",sequence)
     for m in matches:
        match_id = f'{m.start()}_{m.end()}_{m.group()}'
        match_list.append(match_id)
        print(name, m.start(), m.end(), m.group())
print("Total number of sequences:", num_sequences)
print("Peptide frequency:", collections.Counter(match_list))
ADD COMMENT

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6