I have an alignment in standard FASTA format and would like to convert it to the PHYLIP format. This, by itself, is not complicated. The problem is that the alignment I'm trying to work with is the full current release of the Silva database. All implementations of this conversion I've tried until now attempt to load all sequences in memory which always falls into swap space and finally gets killed on my 8GB workstation.
Do you know of a quick solution ?
Thanks in advance !
If you have to know, I'm trying to convert it to PHYLIP for input into RAxML.
Ah well, I wrote a python script that does it without memory.
#!/usr/bin/env python
"""
Convert fasta alignemnts to relaxed phylip ones in constant memory.
Written by Lucas Sinclair.
Kopimi.
You can use this script from the shell like this::
$ fasta_to_phylip seqs.fasta seqs.phylip
"""
###############################################################################
class Sequence(object):
"""The Sequence object has a string *header* and
various representations."""
def __init__(self, header, seq):
self.header = re.findall('^>(\S+)', header)[0]
self.seq = seq
def __len__(self):
return len(self.seq)
@property
def phylip(self):
return self.header + " " + self.seq.replace('.','-') + "\n"
@property
def fasta(self):
return ">" + self.header + "\n" + self.seq + "\n"
def fasta_parse(path):
"""Reads the file at *path* and yields
Sequence objects in a lazy fashion"""
header = ''
seq = ''
with open(path) as f:
for line in f:
line = line.strip('\n')
if line.startswith('>'):
if header: yield Sequence(header, seq)
header = line
seq = ''
continue
seq += line
yield Sequence(header, seq)
###############################################################################
# The libraries we need #
import sys, os, random, string, re
# Get the shell arguments #
fa_path = sys.argv[1]
ph_path = sys.argv[2]
# Check that the path is valid #
if not os.path.exists(fa_path): raise Exception("No file at %s." % fa_path)
# Use our two functions #
seqs = fasta_parse(fa_path)
# Write the output to temporary file #
tm_path = ph_path + '.' + ''.join(random.choice(string.letters) for i in xrange(10))
# Count the sequences #
count = 0
with open(tm_path, 'w') as f:
for seq in seqs:
f.write(seq.phylip)
count += 1
# Add number of entries and length at the top #
with open(tm_path, 'r') as old, open(ph_path, 'w') as new:
new.write(" " + str(count) + " " + str(len(seq)) + "\n")
new.writelines(old)
# Clean up #
os.remove(tm_path)
As you mention, this process is a really easy data transformation. I think FASTA to NEXUS would be a little easier if you are writing your own parser, and RAxML will take a NEXUS file also. You say you are not having a problem with the actual parsing, so I won't address how you are doing that here.
The problem you are having is a size problem (it's not impossible to parse the entire silva database - anywhere from half to 2Gb in size - depending on which version you use) but depending on how you are parsing it you may be able to use your computer's memory more economically. If you can't parse with less memory, then I would suggest a cluster for large jobs such as this over your desktop/laptop.
I have a question: Why do you want to construct a phylogeny with the entire 2Gb contents of the SILVA database? The LSU and SSU genes aren't evenly informative across all domains. Not really sure where you are going with things, but it seems like a lot of work for naught. Even if you are using a large phylogeny to place environmental amplicons, it's much less cumbersome to use a selective database (and one which can still be very large in size). I would recommend TopiaryExplorer to phylogenetically show placement of OTU data that you would be inferring from your BLAST or alignment to the SILVA database.
Thanks for your answer. I didn't know that RAxML takes nexus files also ! I could get lots more RAM, but it's a problem that doesn't require memory, it's O(1) if you implement it right. I'm not attempting to construct a phylogeny, I wanted to try and use the "-f e" option to compute branch lengths and then use the model from "pplacer" to place sequences. I looked at TopiaryExplorer, but was looking for a solution that didn't require a GUI so as to automatize processing.
Thanks for posting this!