Convert Bam File To Fasta File
5
4
Entering edit mode
13.7 years ago
Sazzad ▴ 40

How can I convert bam file to fasta file using Biopython command? please share if some one have any idea.

Thanks in advance

sazzad

conversion bam fasta biopython • 33k views
ADD COMMENT
0
Entering edit mode

Hi Sazzad. Can you tell us a little more about your problem and why you are choosing to solve it using biopython?

ADD REPLY
0
Entering edit mode

It will be easy if you clear what you did so far!

ADD REPLY
14
Entering edit mode
13.7 years ago

You can do this with a combination of Biopython for writing the Fasta files and pysam for reading the BAM files:

import os
import sys

import pysam
from Bio import SeqIO, Seq, SeqRecord

def main(in_file):
    out_file = "%s.fa" % os.path.splitext(in_file)[0]
    with open(out_file, "w") as out_handle:
        # Write records from the BAM file one at a time to the output file.
        # Works lazily as BAM sequences are read so will handle large files.
        SeqIO.write(bam_to_rec(in_file), out_handle, "fasta")

def bam_to_rec(in_file):
    """Generator to convert BAM files into Biopython SeqRecords.
    """
    bam_file = pysam.Samfile(in_file, "rb")
    for read in bam_file:
        seq = Seq.Seq(read.seq)
        if read.is_reverse:
            seq = seq.reverse_complement()
        rec = SeqRecord.SeqRecord(seq, read.qname, "", "")
        yield rec

if __name__ == "__main__":
    main(*sys.argv[1:])

Run it with:

% python bam_to_fasta.py your_file.bam

and the output will be in your_file.fa

ADD COMMENT
0
Entering edit mode

This works well, thanks Brad. However I found a couple of reasons to consider Picard SamToFastq suggested by Steve Lianoglou instead.

SamToFastq will properly split paired-end reads into two fastq files.

SamToFastq seems to be much faster (about 1/5th the time in my informal testing).

ADD REPLY
0
Entering edit mode

One more thing to note: SamToFastq actually outputs FastQ with quality scores, while this solution outputs Fasta files (no qualities). I actually modified Brad's solution to output FastQ for my testing, since that is what I needed.

ADD REPLY
0
Entering edit mode

Lance, that makes good sense. Picard's SamToFastq is great; I even wrote a Galaxy wrapper for it in the community tool shed: http://community.g2.bx.psu.edu/tool/browse_tools?sort=name&f-state=approved&id=91f644c6d57806f8&webapp=community&operation=view_tool&f-Category.name=SAM

ADD REPLY
0
Entering edit mode

How to add sequencing quality and mapped or unmapped reads as parameters?

Thanks

ADD REPLY
11
Entering edit mode
13.7 years ago

Alternative approach using SAMtools (source: SeqAnswers)

samtools view filename.bam | \
awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta
ADD COMMENT
3
Entering edit mode

The problem with this is that the original strand of the sequence is lost, as SAM always reports sequences on the REF strand. To recreate the original sequence that was aligned, one needs to interrogate the FLAG field and ask if the read was actually aligned to the "-" strand. If so, it needs to be RC'ed. See line 6 of Brad's "bam_to_rec" function above.

ADD REPLY
0
Entering edit mode

As always, the usefulness of the approach depends on your intended use. But thanks for pointing that out, I didn't think of it.

ADD REPLY
0
Entering edit mode

I created an account to up vote this!

thanks for the help.

ADD REPLY
0
Entering edit mode

In certain cases, re-generating fastq is possible with this small modification: samtools view filename.bam | awk '{OFS="\t"; print "@"$1"\n"$10"\n+\n"$11}' >| filename.fastq I have used this successfully with bam files generated by Tophat (on unstranded data).

ADD REPLY
4
Entering edit mode
13.7 years ago

If you're willing to consider using something that's not Python-maden, Picard comes with the SamToFastq utility you can use from the command line that works on SAM and BAM files.

ADD COMMENT
0
Entering edit mode
13.7 years ago
Ben Lange ▴ 210

BAM is not in the list but OpenBabel seems like a pretty good converter application. Maybe someone can come up with a plugin if BAM is common enough.

ADD COMMENT
0
Entering edit mode

but it is mostly designed to handle the chemical file formats.

ADD REPLY
0
Entering edit mode

I'm learning so let's see if I understand:

*BAM is an animated graphics file format. http://iesdp.gibberlings3.net/file_formats/ie_formats/bam_v1.htm

*Fasta is a bioinformatics specific format. http://en.wikipedia.org/wiki/FASTA_format

Doesn't it make more sense to go from Fasta to BAM than the other way around? If you had a GIF file it would be pretty tough to reverse that file back to a language that described what was going on in the image. Unless there's quite a bit of metadata built into the BAM file.

ADD REPLY
0
Entering edit mode

I'm learning about file formats

BAM - is compressed SAM.
SAM – Sequence Alignment/Map format, in which the results of the 1000 Genomes Project will be released.
FASTA – The FASTA file format, for sequence data. Sometimes also given as FNA or FAA (Fasta Nucleic Acid or Fasta Amino Acid).

Both seem to be oriented towards sequencing data. Why would FASTA get lumped in with chemical data but not SAM?

ADD REPLY
0
Entering edit mode
6.8 years ago

And in more recent versions of samtools, you could just use samtools fasta

ADD COMMENT

Login before adding your answer.

Traffic: 1653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6