Question

Concatenate Multiple .Fasta Files

1

Entering edit mode

12.3 years ago

charles.bridges ▴ 70

Hi,

I'm trying to concatenate hundreds of .fasta files into a single, large fasta file containing all of the sequences. I haven't found a specific method to accomplish this in the forums. I did come across this code from http://zientzilaria.heroku.com/blog/2007/10/29/merging-single-or-multiple-sequence-fasta-files, which I have adapted a bit.

Fasta.py contains the following code:

class fasta:
    def __init__(self, name, sequence):
        self.name = name
        self.sequence = sequence

def read_fasta(file):
    items = []
    index = 0
    for line in file:
        if line.startswith(">"):
           if index >= 1:
               items.append(aninstance)
           index+=1
           name = line[:-1]
           seq = ''
           aninstance = fasta(name, seq)
        else:
           seq += line[:-1]
           aninstance = fasta(name, seq)

    items.append(aninstance)
    return items

And here is the adapted script to concatenate .fasta files:

import sys
import glob
import fasta

#obtain directory containing single fasta files for query
filepattern = input('Filename pattern to match: ')

#obtain output directory
outfile = input('Filename of output file: ')

#create new output file
output = open(outfile, 'w')

#initialize lists
names = []
seqs = []

#glob.glob returns a list of files that match the pattern
for file in glob.glob(filepattern):

    print ("file: " + file)

    #we read the contents and an instance of the class is returned
    contents = fasta.read_fasta(open(file).readlines())

    #a file can contain more than one sequence so we read them in a loop
    for item in contents:
        names.appenditem.name)
        seqs.append(item.sequence)

#we print the output
for i in range(len(names)):
    output.write(names[i] + '\n' + seqs[i] + '\n\n')

output.close()
print("done")

It is able to read the fasta files but the newly created output file contains no sequences. The error I receive is due to the fasta.py, which is beyond my capability to mess with:

Traceback (most recent call last):
  File "C:\Python32\myfiles\test\3\Fasta_Concatenate.py", line 28, in <module>
    contents = fasta.read_fasta(open(file).readlines())
  File "C:\Python32\lib\fasta.py", line 18, in read_fasta
    seq += line[:-1]
UnboundLocalError: local variable 'seq' referenced before assignment

Any suggestions? Thanks!

fasta python • 21k views

ADD COMMENT • link updated 12.3 years ago by Sukhi Singh 11k • written 12.3 years ago by charles.bridges ▴ 70

7

Entering edit mode

why don't simply use 'cat *fasta > largenewfasta'?

ADD REPLY • link 12.3 years ago by JC 13k

0

Entering edit mode

It looks like he's on windows...

ADD REPLY • link 12.3 years ago by cl.parallel ▴ 140

5

Entering edit mode

wow, MS-Windows, I have years without touching one except for presentations. In that case, you can use DOS: copy /a *.fasta newfasta.txt

ADD REPLY • link 12.3 years ago by JC 13k

0

Entering edit mode

Thanks for the help, easier than i was thinking ;)

ADD REPLY • link 12.3 years ago by charles.bridges ▴ 70

0

Entering edit mode

cat doesn't seem to work for a ~5000 genomes (protein fasta files). I get the following error: /bin/cat: Argument list too long. Any suggestions?

ADD REPLY • link 7.9 years ago by Janani Ravi • 0

0

Entering edit mode

It looks like you need to look at line 18. Here you assume that seq has been initialised to '' already, but for some reason it is not (eg there are spaces between entries), so it is 'referenced before assignment'. Learning to read tracebacks is tricky but very important and useful.

This is the very literal answer, but you should check out Biopython. This is the kind of common task that doesn't need to be repeatedly implemented. See the SeqIO module: http://biopython.org/wiki/SeqIO

ADD REPLY • link 12.3 years ago by cl.parallel ▴ 140

0

Entering edit mode

I actually use Biopython, and have thought of using the SeqIO module to read through the sequence and add it to a file but am unsure how to do so.

ADD REPLY • link 12.3 years ago by charles.bridges ▴ 70

1

Entering edit mode

You really should just concatenate the files together like the other commentors have recommended. Answering your question though, in case you want to do anything else with the files like verify their integrity/format is correct, you can use SeqIO:

from Bio import SeqIO

# define file_list here, such as a directory glob

with open('large_fasta.fasta', 'w') as w_file:
    for filen in file_list:
        with open(filen, 'rU') as o_file:
            seq_records = SeqIO.parse(o_file, 'fasta')
            SeqIO.write(seq_records, w_file, 'fasta')

ADD REPLY • link 12.3 years ago by cl.parallel ▴ 140

0

Entering edit mode

Thank you for your help!

ADD REPLY • link 12.3 years ago by charles.bridges ▴ 70

score 8 · Answer 1 · 2012-07-30

8

Entering edit mode

12.3 years ago

Sukhi Singh 11k

Yeah, why not use just cat from linux utility toolbox. Also, two more specific tool kits that might be useful are catfasta2phyml and phyutility. Sequence manipulation suite (sms)is useful in case when you want to merge fasta entries into a single entry.

>9 base sequence
accgactrm

>20 base sequence
gggggaaaaatttttccccc

>result
accgactrmgggggaaaaatttttccccc

Cheers

ADD COMMENT • link 12.3 years ago by Sukhi Singh 11k

0

Entering edit mode

Sorry, I got mistaken.

ADD REPLY • link 7.2 years ago by bio90029 ▴ 10