Question

find the desired AA sequence location in Protein fasta file

0

Entering edit mode

3.2 years ago

shivam-gupta • 0

I am working with FASTA files of protein. I want to locate the desired AA sequence in every clone of the protein fasta file using python.

    records=SeqIO.parse("protein.fasta", ''fasta'') #to extract protein sequences from FASTA file
    for record in records:
        output=record.sec 


    print(output) #just to show how the output looks like.
    #I used ** to hightlight the desired area
    enter code here
    -->VVSREL**QALEA**IRQKDEEDABCKARFRGIFSH
    -->VVSRPQREEARJKLMIRQKDEED**KARFRG**IFSH
    -->VVSREL**QALEA**RIRDKARFRGIFSH  
    enter code here

 f=open('amino_acids.txt', 'r')     #to get the AA sequences from the text file or what is inside the file
 for i in f:   #to show how this file looks like
     print(i)
    -->'QALEA', 'KARFRG', 'QALEAR','KAKAKA', 'PAKAR'
#to match my AA sequences with the protein sequences
 for i in f:
    for j in output:
        if i in j:
            print('found')
        else:
            print('not fount')
 #output
    --> error
    --> error
    -->error

How to locate the desired AA sequences in the protein fasta file.

Any help will be appreciated.

python biopython • 1.4k views

ADD COMMENT • link updated 2.0 years ago by Ram 44k • written 3.2 years ago by shivam-gupta • 0

0

Entering edit mode

I have a related script that may help you adapt yours as well. It's called find_sequence_element_occurrences_in_sequence.py and the information about it is here, including a link to a demo Jupyter notebook and how you can run it using sessions served by the MyBinder.org. It's not as refined as my more recent script development; however, it may give you some ideas. Also note the description takes about how it was originally written for nucleic acid and so searches what it thinks should be there as another strand that's moot in case of protein and suggests how to fix.

Down the road, you may want fuzzier search abilities with pattern matching or using regular expressions for examining sequences, and on the README page of that sequencework/FindSequence subrepo there's some resources and information about that. It links to a whole demo on use of PatMatch that I made. PatMatch is a program for finding patterns in peptide and nucleotide sequences. Plus you may want to incorporate some approaches where there's less for you to maintain and I've got a list of related resources there as well.

ADD REPLY • link 3.2 years ago by Wayne ★ 2.1k

1

Entering edit mode

Since the original question is about python code I will move this to a comment. While useful this is not directly answering the original question.

ADD REPLY • link 3.2 years ago by GenoMax 149k

score 0 · Answer 1 · 2021-12-02

you can use the find method of Seq object to search for matches. It will return the position of the start.

from Bio.Seq import Seq
from collections import defaultdict

records = SeqIO.parse("protein.fasta", ''fasta'') #to extract protein sequences from 
amino_acids = open('amino_acids.txt', 'r') 

matches = defaultdict(list)
missing = defaultdict(list)
for record in records:
    for aa in amino_acids:  
        m = record.find(aa)
        if m > 0:
            print(record.id)
            print("matched")
            matches[aa].append(record.id)
        else:
            print(record.id)
            print("not matched")
            missing[aa].append(record.id)

I haven't tested the code but this should work. The output is a dictionary that contains the amino acid queries as the key and the record id (the fasta header) as the entries.