Question

parse of file multifasta in python

0

Entering edit mode

6.5 years ago

erick_rc93 ▴ 30

I'm trying to parse a multifasta file with python with the next script

import re

def loadFasta(filename):
    if (filename.endswith(".gz")):
        fp = gzip.open(filename, 'rb')
    else:
        fp = open(filename, 'rb')
    # split at headers
    data = fp.read().split(">")
    fp.close()
    # ignore whatever appears before the 1st header
    data.pop(0)     
    headers = []
    sequences = []
    for sequence in data:
        lines = sequence.split('\n')
        headers.append(lines.pop(0))
        # add an extra "+" to make string "1-referenced"
        sequences.append('+' + ''.join(lines))
    return (headers, sequences)

header, seq = loadFasta("/path/to/fasta/all_chromosomes.fasta")

for i in xrange(len(header)):
    print (header[i])
    print (len(seq[i])-1, "bases", seq[i][:30], "...", seq[i][-30:])
    print

genome = seq[0]

But when I try to run the above script I get the next message error

Traceback (most recent call last):
  File "parser_fasta.py", line 23, in <module>
    header, seq = loadFasta("/path/to/fasta/all_chromosomes.fasta")
  File "parser_fasta.py", line 10, in loadFasta
    data = fp.read().split(">")
TypeError: a bytes-like object is required, not 'str'

sequence • 2.4k views

ADD COMMENT • link updated 6.5 years ago by finswimmer 16k • written 6.5 years ago by erick_rc93 ▴ 30

1

Entering edit mode

any reason you didn't try https://biopython.org/wiki/SeqIO?

ADD REPLY • link 6.5 years ago by Kevin ▴ 640

score 1 · Answer 1 · 2019-02-28

You open your file in binary mode with rb. This isn't necessary, just use r or omit this parameter, because that's the default.

What also works is to use data = fp.read().decode().split(">")
When working with files, it is good practice to use the withstatement because this takes care about closing you file correct
As Kevin said before, I would highly recommend using an existing module for handling fasta files like biopython.

fin swimmer