Parsing Fasta Based On Header
4
1
Entering edit mode
13.2 years ago
Matt ▴ 110

Hi All,

I have a large fasta file that I am trying to sort into multiple smaller files based on their ID's. The File starts like this:

>1MUSgi|116063569|ref|NM_010065.2|
AGGGG-TGGTTGACCATCAACAACATCGGCATCATG-AAGGGAGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>2MUSgi|118130562|ref|NM_019880.3|
CGGCCCGCGGCTCAGCCGTCGGCGCGCAGGATGGACGGCG-A
>2MUSgi|118130562|ref|NM_019880.3|
AGTTTAGCCAGGCCCTGGCCATCCGGAGCTACACCAAGTTTGTGATGGGGATTGCAGTGAGCATGCTGACCTACCCCTTCCTGCTCGTTGGAGATCTCATGGCAGTGAACAACCCTGGAGTAACCT
>1HOMOgi|59853098|ref|NM_004408.2|
GCATCCGCAAGGGCTGGCTGACTATCAATAATATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTTGTGCTGACTGCTGAG-
>1
GGTGATCCGCAGGGGCTGGCTGACCATCAACAACATTGGCATCATGAAAGGGGGCTCCAAGGAGTACTGGTTCGTGCTCACTGCCGAGTCACTGTCCTGGTACAAGGACGAAGAGGAGAAAGAGAG
>2
CGCGCCAGCACCGGCCCGCGGCGCAGCCCTCGGCCCGCAGGATGGACGGCGCGTCCGGGGGCCTGGGCTCTGGGGATAGTGCC

I want all of the ID's beginning with 1's to go on one file, ID's starting with 2's in another, and so on....

I have been trying to use SeqIO

for record in SeqIO.parse(open("QHM-clean.fasta", "rU"), "fasta") :
    for i in range(1,3):     
        if record.id %i:  #this needs to be changed "if record.id STARTS WITH %i"
            print record.id
            output_handle = open("%i.fasta", "w") #naming in this manner does not seem to be allowed
            SeqIO.write(output_handle, "fasta")
        output_handle.close()

But this seems to be not working in many obvious ways... Can anybody help me out with some advice on how to proceed?

biopython python fasta • 6.5k views
ADD COMMENT
4
Entering edit mode

The percent operator in Python is a numerical operator, while the record ID is a string. Try record.id.startswith("1") for a Pythonic approach. Also, Biopyhton's SeqIO.write(...) function takes three required arguments, not just two. And there was also something wrong with your indentation. I'll try to give a more detailed answer later.

ADD REPLY
0
Entering edit mode

I see now you've also asked on the Biopython mailing list, and got an answer there.

BioStar needs a clearer policy on "double requests" like this...

ADD REPLY
10
Entering edit mode
13.2 years ago

Simply, using awk:

/^$/    { next; }
/^>/    { filename="other.fa";}
/^>1/   { filename="1.fa"; }
/^>2/   { filename="2.fa"; }
    {
    print >> filename;
    }

and:

awk -f script.awk < file.fa
ADD COMMENT
2
Entering edit mode

@Matt no need for a loop, awk will call the code listed in the script for each line of the file, so run it on the whole file

ADD REPLY
0
Entering edit mode

how do I incorporate this into a loop.. i have several thousand records to process in this manner.

ADD REPLY
0
Entering edit mode

what is a 'record?'. this awk script will work if your fasta file contains more than one sequence.

ADD REPLY
0
Entering edit mode

Given the way we use the term in the Biopython documentation, a FASTA record is a ">" line and the sequence that goes with it. So yes, 1000s of FASTA sequences.

ADD REPLY
3
Entering edit mode
13.2 years ago
Peter 6.0k

OK, let's go with what is probably the simplest answer using Biopython's SeqIO module - scan the file three times (depending on what you meant by "and so on"):

from Bio import SeqIO
input_file = "QHM-clean.fasta"

wanted = (r for r in SeqIO.parse(input_file, "fasta") \
          if r.id.startswith("1"))
count = SeqIO.write(wanted, "starts_1.fasta", "fasta")
print "Saved %i records starting with 1" % count

wanted = (r for r in SeqIO.parse(input_file, "fasta") \
          if r.id.startswith("2"))
count = SeqIO.write(wanted, "starts_2.fasta", "fasta")
print "Saved %i records starting with 2" % count

wanted = (r for r in SeqIO.parse(input_file, "fasta") \
          if not (r.id.startswith("1") or r.id.startswith("2")))
count = SeqIO.write(wanted, "starts_other.fasta", "fasta")
print "Saved %i records starting with neither 1 nor 2" % count

That uses Python generator expressions so only one record is in memory at a time, so it should work even with large files. This could be improved to make just one pass through the input file, and write the three output files at the same time - but that is more complicated.

ADD COMMENT
2
Entering edit mode
13.2 years ago

You can do this with Biopieces www.biopieces.org) like this:

read_fasta -i in.fna | split_vals -k SEQ_NAME -d '|' | write_fasta_files -k SEQ_NAME_0 -d . -x
ADD COMMENT
1
Entering edit mode
13.2 years ago
John Kern ▴ 60

Hello,

Python provides a nice little regular expression package. Consider

from Bio import SeqIO
import re

def split():
# copy of data from post
inport = open("test.fasta","r")
# output for ones
outport1 = open("one.fasta", "w")
# output for twos
outport2 = open("two.fasta","w")
for record in SeqIO.parse(inport,"fasta"):
    if re.match("^1", record.id):
        SeqIO.write([record], outport1, "fasta")
    elif re.match("^2",record.id):
        SeqIO.write([record], outport2, "fasta")
inport.close()
outport1.close()
outport2.close()
ADD COMMENT

Login before adding your answer.

Traffic: 3461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6