Question

B/Sam To Fasta In R

1

Entering edit mode

12.2 years ago

laemtao ▴ 40

I'm looking for a way in Bioconductor to convert a bam file to a filtered fasta file.

Currently, I've come across two methods to read bam files in R:

CDS1 <- readGappedAlignments("aln.bam")
CDS2 <- scanBam("aln.bam")

I like using the first method as I can seamlessly utilize the *ranges packages. Is there a way to retrieve sequence data from a readGappedAlignments object? If not, what is the preferred method to construct a data.frame with: "rname", "strand", "start", "stop", "width", "sequence" from a bam file inside of R?

Thanks!

r bioconductor bam sam fasta • 4.5k views

ADD COMMENT • link updated 10.9 years ago by Biostar 20 • written 12.2 years ago by laemtao ▴ 40

score 3 · Answer 1 · 2012-09-12

You can configure your readGappedAlignments call with a custom param object that returns the things you like. So if you are asking if you want to get the sequences of the reads aligned you can do so like this:

library(GenomicRanges)
library(Rsamtools)
bam <- system.file("extdata", "ex1.bam", package="Rsamtools")
ga <- readGappedAlignments(bam, param=ScanBamParam(what="seq"))
head(ga)
GappedAlignments with 6 alignments and 1 metadata column:
      seqnames strand       cigar    qwidth     start       end     width      ngap |                                  seq
         <Rle>  <Rle> <character> <integer> <integer> <integer> <integer> <integer> |                       <DNAStringSet>
  [1]     seq1      +         36M        36         1        36        36         0 | CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
  [2]     seq1      +         35M        35         3        37        35         0 |  CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT
  [3]     seq1      +         35M        35         5        39        35         0 |  AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC
  [4]     seq1      +         36M        36         6        41        36         0 | GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT
  [5]     seq1      +         35M        35         9        43        35         0 |  GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG
  [6]     seq1      +         35M        35        13        47        35         0 |  ATTGTAAATGTGTGGTTTAACTCGTCCCTGGCCCA
  ---
  seqlengths:
   seq1 seq2
   1575 1584

Converting this to the data.frame you are after should be pretty straight forward.

If, however, you are rather interested in getting the genomic sequences that are fenceposted by the ranges returned in your GappedAlignments, you can use the getSeq function from BSgenome package to get the genomic sequences you are after -- I suspect you aren't asking for this, so I'll leave this as an exercise to the reader.