Question

Parsing Fasta Based On Header

1

Entering edit mode

13.1 years ago

Matt ▴ 110

Hi All,

I have a large fasta file that I am trying to sort into multiple smaller files based on their ID's. The File starts like this:

>1MUSgi|116063569|ref|NM_010065.2|
AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>2MUSgi|118130562|ref|NM_019880.3|
CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A
>2MUSgi|118130562|ref|NM_019880.3|
AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT
>1HOMOgi|59853098|ref|NM_004408.2|
GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>1
GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG
>2
CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC

I want all of the ID's beginning with 1's to go on one file, ID's starting with 2's in another, and so on....

I have been trying to use SeqIO

for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") :
    for i in range(1,3):     
        if record.id %i:  #this needs to be changed "if record.id STARTS WITH %i"
            print record.id
            output_handle = open("%i.fasta", "w") #naming in this manner does not seem to be allowed
            SeqIO.write(output_handle, "fasta")
        output_handle.close()

But this seems to be not working in many obvious ways... Can anybody help me out with some advice on how to proceed?

biopython python fasta • 6.4k views

ADD COMMENT • link updated 13.1 years ago by John Kern ▴ 60 • written 13.1 years ago by Matt ▴ 110

4

Entering edit mode

The percent operator in Python is a numerical operator, while the record ID is a string. Try record.id.startswith("1") for a Pythonic approach. Also, Biopyhton's SeqIO.write(...) function takes three required arguments, not just two. And there was also something wrong with your indentation. I'll try to give a more detailed answer later.

ADD REPLY • link 13.1 years ago by Peter 6.0k

0

Entering edit mode

I see now you've also asked on the Biopython mailing list, and got an answer there.

BioStar needs a clearer policy on "double requests" like this...

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 13.1 years ago by Peter 6.0k

score 10 · Answer 1 · 2011-11-01

10

Entering edit mode

13.1 years ago

Pierre Lindenbaum 164k

Simply, using awk:

/^$/    { next; }
/^>/    { filename="other.fa";}
/^>1/   { filename="1.fa"; }
/^>2/   { filename="2.fa"; }
    {
    print >> filename;
    }

and:

awk -f script.awk < file.fa

ADD COMMENT • link 13.1 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

@Matt no need for a loop, awk will call the code listed in the script for each line of the file, so run it on the whole file

ADD REPLY • link 13.1 years ago by Istvan Albert 101k

0

Entering edit mode

how do I incorporate this into a loop.. i have several thousand records to process in this manner.

ADD REPLY • link 13.1 years ago by Matt ▴ 110

0

Entering edit mode

what is a 'record?'. this awk script will work if your fasta file contains more than one sequence.

ADD REPLY • link 13.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Given the way we use the term in the Biopython documentation, a FASTA record is a ">" line and the sequence that goes with it. So yes, 1000s of FASTA sequences.

ADD REPLY • link 13.1 years ago by Peter 6.0k

score 3 · Answer 2 · 2011-11-01

OK, let's go with what is probably the simplest answer using Biopython's SeqIO module - scan the file three times (depending on what you meant by "and so on"):

from Bio import SeqIO
input_file = "QHM-clean.fasta"

wanted = (r for r in SeqIO.parse(input_file, "fasta") \
          if r.id.startswith("1"))
count = SeqIO.write(wanted, "starts_1.fasta", "fasta")
print "Saved %i records starting with 1" % count

wanted = (r for r in SeqIO.parse(input_file, "fasta") \
          if r.id.startswith("2"))
count = SeqIO.write(wanted, "starts_2.fasta", "fasta")
print "Saved %i records starting with 2" % count

wanted = (r for r in SeqIO.parse(input_file, "fasta") \
          if not (r.id.startswith("1") or r.id.startswith("2")))
count = SeqIO.write(wanted, "starts_other.fasta", "fasta")
print "Saved %i records starting with neither 1 nor 2" % count

That uses Python generator expressions so only one record is in memory at a time, so it should work even with large files. This could be improved to make just one pass through the input file, and write the three output files at the same time - but that is more complicated.

score 2 · Answer 3 · 2011-11-01

2

Entering edit mode

13.1 years ago

Martin A Hansen 3.0k

You can do this with Biopieces www.biopieces.org) like this:

read_fasta -i in.fna | split_vals -k SEQ_NAME -d '|' | write_fasta_files -k SEQ_NAME_0 -d . -x

ADD COMMENT • link 13.1 years ago by Martin A Hansen 3.0k

score 1 · Answer 4 · 2011-11-02

Hello,

Python provides a nice little regular expression package. Consider

from Bio import SeqIO
import re

def split():
# copy of data from post
inport = open("test.fasta","r")
# output for ones
outport1 = open("one.fasta", "w")
# output for twos
outport2 = open("two.fasta","w")
for record in SeqIO.parse(inport,"fasta"):
    if re.match("^1", record.id):
        SeqIO.write([record], outport1, "fasta")
    elif re.match("^2",record.id):
        SeqIO.write([record], outport2, "fasta")
inport.close()
outport1.close()
outport2.close()