Question

Searching through a multi-fasta file for records containing certain words in their description in BioPython 1.65, Python 3.4

1

Entering edit mode

10.5 years ago

tyleraelliott ▴ 50

I have a fasta file which contains thousands of sequences, with headers as such:

>scaffold_1|c(135298..135582)|DNA|DNA-0-1_NV

Each pipe-deliminated section of the header can vary from sequence to sequence, and some sequences might have identical headers except for the first or second sections.

I need to be able to search through this large file and pick out and print to another file specific sequences based upon their header. There needs to be degeneracy in this search however. I have seen examples where a library text file was used but only exact matches between the fasta file and library file would work.

For instance, let's say I want all sequences which have any variation on 'piggyBac' in their header (so PiggyBac, piggybac,DNA-piggyBac, etc.).

I'm just at a loss as to how to do this exactly. Is there some way to index this file and then search the keys for variations on 'piggyBac'? If anyone has suggestions or can point me to code that does something similar it would really be helpful.

I appreciate it

BioPython • 7.1k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by tyleraelliott ▴ 50

0

Entering edit mode

I should mention I'm using Python3 and the latest release of BioPython

ADD REPLY • link 10.5 years ago by tyleraelliott ▴ 50

0

Entering edit mode

It would be better to say which version of Python 3 (e.g. 3.3? 3.4?) and which version of Biopython (e.g. 1.65?), partly people may be reading this question in a years time, but also in case it helps give you are more accurate answer.

ADD REPLY • link 10.5 years ago by Peter 6.0k

0

Entering edit mode

Edited to show versions for both

ADD REPLY • link 10.5 years ago by tyleraelliott ▴ 50

2

Entering edit mode

10.5 years ago

Whetting ★ 1.6k

I would do something like

from Bio import SeqIO
with open("new_file.fas","w") as out:
    for record in SeqIO.parse("fasta.fas","fasta"):
        if "PIGGYBAC" in record.id.upper():
            print(">"+record.id, file=out)
           print(record.seq, file=out)

This should normalize the caps and look for your term in the header...

Cheers

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

That did the trick! All I need to do now is to be able to write those sequence to a new fasta file

ADD REPLY • link 10.5 years ago by tyleraelliott ▴ 50

0

Entering edit mode

see edit for file printing

ADD REPLY • link 10.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

Note that the OP is using Python3, so print >>out, ">"+record.idshould be print(">"+record.id, file=out), or even better yet use Bio.SeqIO to write a FASTA record using the Seq object.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Matt Shirley 10k

0

Entering edit mode

I did not know that python 3 used different print, but thanks for the note. Edited

ADD REPLY • link 10.5 years ago by Whetting ★ 1.6k

0

Entering edit mode

10.5 years ago

tyleraelliott ▴ 50

I was going to post last night that the previous coding still gave me an error but I had reached my limit. This new coding works! Although I had to make one change to it:

if "PIGGYBAC" in record.id.upper())

modified to

if "PIGGYBAC" in r.id.upper.())

Now it works just fine. Thanks to Dr. Cock, Matt Shirley and Whetting for all your help.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by tyleraelliott ▴ 50

0

Entering edit mode

I'm glad we could help. Note BioStars (and StackExchange) work on a question and answer format. Your reply would be better as a comment, and you can tick an answer to accept it (which is linked to the user scoring system which some people find motivating).

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Peter 6.0k

Ram · Accepted Answer · 2015-01-22

The most elegant solution would use an iterator approach. This is based on the "Filtering a sequence file" example in the tutorial only rather than filtering based on the identifiers, here we are checking the description:

from Bio import SeqIO
input_file = "big_file.fasta"
output_file = "piggybac.fasta"
# The next line is a Python generator expression - memory efficient!
records = (r for r in SeqIO.parse(input_file, "fasta") if "PIGGYBAC" in r.id.upper())
count = SeqIO.write(records, output_file, "fasta")
print("Saved %i records from %s to %s" % (count, input_file, output_file))

Updated with variable name correction pointed out by tyleraelliott