You can do this with a combination of Biopython for writing the Fasta files and pysam for reading the BAM files:
import os
import sys
import pysam
from Bio import SeqIO, Seq, SeqRecord
def main(in_file):
out_file = "%s.fa" % os.path.splitext(in_file)[0]
with open(out_file, "w") as out_handle:
# Write records from the BAM file one at a time to the output file.
# Works lazily as BAM sequences are read so will handle large files.
SeqIO.write(bam_to_rec(in_file), out_handle, "fasta")
def bam_to_rec(in_file):
"""Generator to convert BAM files into Biopython SeqRecords.
"""
bam_file = pysam.Samfile(in_file, "rb")
for read in bam_file:
seq = Seq.Seq(read.seq)
if read.is_reverse:
seq = seq.reverse_complement()
rec = SeqRecord.SeqRecord(seq, read.qname, "", "")
yield rec
if __name__ == "__main__":
main(*sys.argv[1:])
Run it with:
% python bam_to_fasta.py your_file.bam
and the output will be in your_file.fa
This works well, thanks Brad. However I found a couple of reasons to consider Picard SamToFastq suggested by Steve Lianoglou instead.
SamToFastq will properly split paired-end reads into two fastq files.
SamToFastq seems to be much faster (about 1/5th the time in my informal testing).
Lance, that makes good sense. Picard's SamToFastq is great; I even wrote a Galaxy wrapper for it in the community tool shed: http://community.g2.bx.psu.edu/tool/browse_tools?sort=name&f-state=approved&id=91f644c6d57806f8&webapp=community&operation=view_tool&f-Category.name=SAM
Alternative approach using SAMtools (source: SeqAnswers)
samtools view filename.bam | \
awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta
The problem with this is that the original strand of the sequence is lost, as SAM always reports sequences on the REF strand. To recreate the original sequence that was aligned, one needs to interrogate the FLAG field and ask if the read was actually aligned to the "-" strand. If so, it needs to be RC'ed. See line 6 of Brad's "bam_to_rec" function above.
If you're willing to consider using something that's not Python-maden, Picard comes with the SamToFastq utility you can use from the command line that works on SAM and BAM files.
I'm learning so let's see if I understand:
*BAM is an animated graphics file format. http://iesdp.gibberlings3.net/file_formats/ie_formats/bam_v1.htm
*Fasta is a bioinformatics specific format. http://en.wikipedia.org/wiki/FASTA_format
Doesn't it make more sense to go from Fasta to BAM than the other way around? If you had a GIF file it would be pretty tough to reverse that file back to a language that described what was going on in the image. Unless there's quite a bit of metadata built into the BAM file.
I'm learning about file formats
BAM - is compressed SAM.
SAM – Sequence Alignment/Map format, in which the results of the 1000 Genomes Project will be released.
FASTA – The FASTA file format, for sequence data. Sometimes also given as FNA or FAA (Fasta Nucleic Acid or Fasta Amino Acid).
Both seem to be oriented towards sequencing data. Why would FASTA get lumped in with chemical data but not SAM?
And in more recent versions of samtools, you could just use samtools fasta
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi Sazzad. Can you tell us a little more about your problem and why you are choosing to solve it using biopython?
It will be easy if you clear what you did so far!