parse of file multifasta in python
1
0
Entering edit mode
5.7 years ago
erick_rc93 ▴ 30

I'm trying to parse a multifasta file with python with the next script

import re

def loadFasta(filename):
    if (filename.endswith(".gz")):
        fp = gzip.open(filename, 'rb')
    else:
        fp = open(filename, 'rb')
    # split at headers
    data = fp.read().split(">")
    fp.close()
    # ignore whatever appears before the 1st header
    data.pop(0)     
    headers = []
    sequences = []
    for sequence in data:
        lines = sequence.split('\n')
        headers.append(lines.pop(0))
        # add an extra "+" to make string "1-referenced"
        sequences.append('+' + ''.join(lines))
    return (headers, sequences)

header, seq = loadFasta("/path/to/fasta/all_chromosomes.fasta")

for i in xrange(len(header)):
    print (header[i])
    print (len(seq[i])-1, "bases", seq[i][:30], "...", seq[i][-30:])
    print

genome = seq[0]

But when I try to run the above script I get the next message error

Traceback (most recent call last):
  File "parser_fasta.py", line 23, in <module>
    header, seq = loadFasta("/path/to/fasta/all_chromosomes.fasta")
  File "parser_fasta.py", line 10, in loadFasta
    data = fp.read().split(">")
TypeError: a bytes-like object is required, not 'str'
sequence • 2.1k views
ADD COMMENT
1
Entering edit mode

any reason you didn't try https://biopython.org/wiki/SeqIO?

ADD REPLY
1
Entering edit mode
5.7 years ago

You open your file in binary mode with rb. This isn't necessary, just use r or omit this parameter, because that's the default.

  • What also works is to use data = fp.read().decode().split(">")
  • When working with files, it is good practice to use the withstatement because this takes care about closing you file correct
  • As Kevin said before, I would highly recommend using an existing module for handling fasta files like biopython.

fin swimmer

ADD COMMENT

Login before adding your answer.

Traffic: 2219 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6