If you've got the RAM to load your entire BAM file into memory, perhaps you might try to see if data.table is up to the task.
Here's an example that you might use to group reads by qname
using the data that comes with the Rsamtools package:
libray(Rsamtools)
library(data.table)
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
res0 <- scanBam(fl)[[1]]
dt <- data.table(qname=res0$qname, rname=res0$rname, strand=res0$strand,
start=res0$pos, end=res0$pos + res0$qwidth - 1L,
key="qname")
Note the value for key
. If you are hurting for memory, tailor your call to scanBam
to only return the columns from the BAM file that you want.
Now, to see how many places each read maps to, you can create a summary table like so:
mmap <- dt[, list(N=.N), by="qname"]
mmap
qname N
[1,] B7_589:1:101:825:28 2
[2,] B7_589:1:110:543:934 2
[3,] B7_589:1:122:337:968 2
[4,] B7_589:1:122:77:789 2
[5,] B7_589:1:168:69:249 2
[6,] B7_589:1:29:529:379 2
[7,] B7_589:2:30:644:942 2
[8,] B7_589:2:73:730:487 2
[9,] B7_589:2:9:49:661 2
[10,] B7_589:3:71:478:175 2
First 10 rows of 1699 printed.
.N
is a variable that tells you how many rows are in the group you are currently working in. The N
column tells you how many times the qname
appears in the BAM file you just loaded.
There is a small learning curve to learn how to use data.table
, so make sure to read the FAQ and other docs. You would basically use it in similar situations you would use plyr::ddply
or base::tapply
(among others), and in situations like you have here (lots of rows in a data.frame), you'll see that it will be super performant (in time and memory usage) as compared to equivalent approaches.
Why not just:
R is great, but it is not the right one for this task.
Thanks! It works and much much faster!
Doing it by hand with apply() and as.matrix would be my first hint. And maybe you can speed it up using the aaply() function (parallel apply) from the "plyr" package.