Question

slicing reads of a multi-fasta file and writing them to a new file

0

Entering edit mode

6.9 years ago

Florian BERNARD ▴ 20

Hi everyone,

I am sure this is not a very complicated problem but for some reason I cannot find a straight answer ( and/or I miss to see the logic in some of the answers I saw on other's people post). As the title says, I am trying to find a way to open a multi-fasta file, extract the first n bases of each read and "write" all of that to a new fasta file. Due to my inexperience I assumed it would be as easy as:

import Bio
reads = SeqIO.parse(file,"fasta")
end = []
n = 50
for record in reads:
    record[:n].append(end)
SeqIO.write(end, "end.fasta", "fasta")

But obviously there is several things wrong with my code and I don't really know how to fix it. It seems I would need to construct a SeqRecord object with the new sliced read and then pass that to SeqIO.write but I'm not sure how to manage to have all the fasta reads written in the same file at once and not having each record write over the last one.. any help/indication would be greatly appreciated ! Thanks in advance.

python biopython fasta • 3.9k views

ADD COMMENT • link updated 6.9 years ago by WouterDeCoster 47k • written 6.9 years ago by Florian BERNARD ▴ 20

score 2 · Answer 1 · 2018-02-18

2

Entering edit mode

6.9 years ago

WouterDeCoster 47k

You seem to have a few things mixed up.

First, you need to import SeqIO, and not just Bio. Your code would throw an error at line 2 since SeqIO is not defined.

from Bio import SeqIO

Second, your description and code don't match. You write that you want the first n bases of each read, but take the last n bases ( and use a "end" list.). EDIT: woops I was wrong here.

Third, you are using the .append() method wrongly. Append is a method of a list, so of end.
Instead, you want to do:

end.append(record[:n])

Finally, I don't think there is a good reason here to store everything in a list. That's probably not a problem with your current dataset, but not very memory efficient. I'd suggest writing records out immediately.

So my suggestion would be:

import sys
from Bio import SeqIO
n = 50
for record in SeqIO.parse(sys.argv[1],"fasta")
    print(record[:n].format("fasta"))

Save this script as slice_last_50.py or any other name you like and use as:

python slice_last_50.py myinputfile.fasta > myoutputfile.fasta

ADD COMMENT • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks for your reply.

You seem to have a few things mixed up.

First, you need to import SeqIO, and not just Bio. Your code would throw an error at line 2 since SeqIO is not defined.
from Bio import SeqIO
  

I don't know if it is a feature of jupyter notebook or not but just writing import Bio works as good as what you wrote. Though I would agree just importing SeqIO would be faster than importing the whole biopython package ?

Second, your description and code don't match. You write that you want the first n bases of each read, but take the last n bases ( and use a "end" list.).

I made several tests by printing record[:n] and it was effectively printing the first n bases of the sequence of the record. My description was a bit misleading though, I'm sorry.. I wanted to make my question simpler to understand but what I actually want is to take the first n bases and write them to a file and do the same for the last n bases. So in my first code I had two list, one called end5 (for the first n bases) and one called end3 (for the last n bases).

Third, you are using the .append() method wrongly. Append is a method of a list, so of end. Instead, you want to do:
end.append(record[:n])
  

That's probably the main reason what it didn't seem to work..

Finally, I don't think there is a good reason here to store everything in a list. That's probably not a problem with your current dataset, but not very memory efficient. I'd suggest writing records out immediately.

So my suggestion would be:
import sys
from Bio import SeqIO
n = 50
for record in SeqIO.parse(sys.argv[1],"fasta")
    print(record[:n].format("fasta"))
  

I've got a file of 1.5million sequences to slice so writting everything to a list was definitely not a good idea but I had no idea how to do without it, so thanks a lot ! I'm still learning python by myself so I make rookie mistakes ..

Save this script as slice_last_50.py or any other name you like and use as:
python slice_last_50.py myinputfile.fasta > myoutputfile.fasta
  

I eventually figured it would be possible to do that but I must admit I am surprised there is no way to write a multifasta file at once.

Thanks again for your help, I will try that and will come back to accept your answer when I have it working !

ADD REPLY • link 6.9 years ago by Florian BERNARD ▴ 20

0

Entering edit mode

About importing Bio: Remarkable, at least in my python interpreter I get the following:

>>> import Bio
>>> SeqIO.parse()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'SeqIO' is not defined

Anyway, that's not the main problem here.

You are absolutely right with my second comment, I messed up there.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

1

Entering edit mode

At least now I understand better why I see from x import y in most of the code posted online when import x works totally fine for me. It's good to keep it in mind though.

As for my problem here, your code helped me find a better suited solution which I post here in case anyone might need it:

import Bio
with open(out_file, "w") as handle: 
for record in SeqIO.parse(input_file,"fasta"):
    SeqIO.write(record[:n], handle, "fasta")

Thanks again for the time you dedicated me, I really appreciate it !

ADD REPLY • link 6.9 years ago by Florian BERNARD ▴ 20