Question

Size of Proteins (Acetohalobium arabaticum - species)

1

Entering edit mode

23 months ago

Maria Eduarda • 0

I have to find out the size of the protein sequence, but even using the codes below, I couldn't. This first code was to find how many proteins there are in total and to find the size of the sequences.

The attached image is just to show what I want the code to search for. I don't know what is missing in the code

arq = open("genoma9.faa")
    conteudo = arq.read()
    print(conteudo)
    fh = open("genoma9.faa")
    n= 0
    for line in fh:
        if line.startswith(">"):
            n+= 1
            print(line)
            proteins = line.count(">")
            print("Total of Proteins: " + str(proteins))

enter image description here

Trying to find this middles characters above the >WP:

Example:

>WP_013277001.1 DNA polymerase III subunit beta [Acetohalobium arabaticum]
MQIKIDRKNFYDGIQTVRKAISSKSTLPILSGILIETQEKKLKLVGTDLELGIECRVDANIIKDGAIVLPANHLANIVRE
LPNKELELELKKDNKIEISCGLSQFKIHGSPADEYPLLPEVGSGIEYTLSQEKFQAMINRIKFATSDDESRPFLTGGLLS

protein python • 1.3k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 23 months ago by Maria Eduarda • 0

0

Entering edit mode

you said. FAA File Sequence

I'm going to post the code now

please, do so now.

ADD REPLY • link 23 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

Answer of the other post:

 openFile = open('genoma9.faa', 'r')
    writeFile = open('updatedFile.txt', 'w')
    for txtLine in openFile .readlines():
        if not (txtLine.startswith('>WP')):
            print(txtLine)
            writeFile.write(txtLine)
    writeFile.close()

    openFile.close()

ADD REPLY • link 23 months ago by Maria Eduarda • 0

0

Entering edit mode

Have you tried running this piece of code? It looks like it has an indentation error?

ADD REPLY • link 23 months ago by barslmn ★ 2.3k

0

Entering edit mode

this post is the same of your previous one Print the size of a protein . Stop asking new questions and update your original post.

ADD REPLY • link 23 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

I reposted because I deleted the other one since I didn't post the code in the old post.

ADD REPLY • link 23 months ago by Maria Eduarda • 0

0

Entering edit mode

The edit button is for edits, no need to delete.

ADD REPLY • link 23 months ago by ATpoint 85k

score 1 · Accepted Answer · 2022-12-30

I think you are reinventing the wheel. There is no need to write separate code for handling biological sequences when it all exists in BioPython and can be accessed in several lines of code. What I show below could be optimized, so it is only for illustration. I suggest you save it into a file fasta_len_and_number.py or something like that.

import sys
from Bio import SeqIO

# open the file specified after script name
FastaFile = open(sys.argv[1], 'r')

counter = 0 # initialize sequence counter
for rec in SeqIO.parse(FastaFile, 'fasta'):
    counter = counter + 1 # increase sequence counter
    name = rec.id # sequence header
    seq = rec.seq # protein/DNA sequence
    seqLen = len(rec) # determine sequence length
    print(seqLen, name) # print the length + header

print('\n A total of %d sequences' % counter)
FastaFile.close()

Running this line:

python fasta_len_and_number.py genoma9.faa

will make hopefully a desired output. On one of my files when I tested, the last 10 lines look like this:

130 2HY5_A
114 2Q68_A
141 4CDL_A
47 6O3S_A
12 4AKT_C
37 4Z80_C
145 1Y23_A
215 6S5A_L

 A total of 112217 sequences