Fast Implementation Of Table(Table()) For Huge Amount Of Elements In R?
3
0
Entering edit mode
12.6 years ago
Ge ▴ 80

I want to get the BAM multi-matching information. For instance, with Rsamtools package,

param = ScanBamParam(what="qname")
bamFlag(param) = scanBamFlag(isUnmappedQuery=FALSE)
bamReads = scanBam(bamFile, param=param)[[1]]$qname

Now I get the bamReads (the IDs of transcripts) in this BAM file.

Then I want to get the information of multi-matching by doing table(table(bamReads)). However, this step is really slow as bamReads can contain millions of IDs from BAM file.

Do you have any fast implement of doing this table(table()) step? Any C interface or other tools?

Thanks!

r • 3.3k views
ADD COMMENT
3
Entering edit mode

Why not just:

samtools view yours.bam | cut -f1 | sort | uniq -c > output.txt

R is great, but it is not the right one for this task.

ADD REPLY
0
Entering edit mode

Thanks! It works and much much faster!

ADD REPLY
0
Entering edit mode

Doing it by hand with apply() and as.matrix would be my first hint. And maybe you can speed it up using the aaply() function (parallel apply) from the "plyr" package.

ADD REPLY
2
Entering edit mode
12.6 years ago

If you've got the RAM to load your entire BAM file into memory, perhaps you might try to see if data.table is up to the task.

Here's an example that you might use to group reads by qname using the data that comes with the Rsamtools package:

libray(Rsamtools)
library(data.table) ## make sure you have 1.8.0
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
res0 <- scanBam(fl)[[1]] # always list-of-lists
dt <- data.table(qname=res0$qname, rname=res0$rname, strand=res0$strand, 
                 start=res0$pos, end=res0$pos + res0$qwidth - 1L,
                 key="qname")

Note the value for key. If you are hurting for memory, tailor your call to scanBam to only return the columns from the BAM file that you want.

Now, to see how many places each read maps to, you can create a summary table like so:

mmap <- dt[, list(N=.N), by="qname"]
mmap
                     qname N
 [1,]  B7_589:1:101:825:28 2
 [2,] B7_589:1:110:543:934 2
 [3,] B7_589:1:122:337:968 2
 [4,]  B7_589:1:122:77:789 2
 [5,]  B7_589:1:168:69:249 2
 [6,]  B7_589:1:29:529:379 2
 [7,]  B7_589:2:30:644:942 2
 [8,]  B7_589:2:73:730:487 2
 [9,]    B7_589:2:9:49:661 2
[10,]  B7_589:3:71:478:175 2
First 10 rows of 1699 printed.

.N is a variable that tells you how many rows are in the group you are currently working in. The N column tells you how many times the qname appears in the BAM file you just loaded.

There is a small learning curve to learn how to use data.table, so make sure to read the FAQ and other docs. You would basically use it in similar situations you would use plyr::ddply or base::tapply (among others), and in situations like you have here (lots of rows in a data.frame), you'll see that it will be super performant (in time and memory usage) as compared to equivalent approaches.

ADD COMMENT
1
Entering edit mode
12.6 years ago

Personally, I wouldn't do this in R but using Python. R is great for many things, but basically not for computation involving huge loops.

Probably not optimal, but try this (on a file containing all your bamReads:

f=open("file.dat")
d={}
for i in f.readlines():
    tmp=i.strip()
    if d.has_key(tmp):
       d[tmp]+=1
    else:
       d[tmp]=1

n={}
for i in d.keys():
    if n.has_key(d[i]):
       n[d[i]]+=1
    else:
       n[d[i]]=1

f.close()
ADD COMMENT
0
Entering edit mode

It's true that loops will, for the most part, really kill you in R, but there are still ways to do this without using loops as long as you have enough RAM to load your data in to memory.

ADD REPLY
0
Entering edit mode

can you provide some python code to count his qName instances?

ADD REPLY
0
Entering edit mode

I've edited my asnwer to provide some Python code for this.

ADD REPLY
1
Entering edit mode
12.6 years ago

This comes down to the problem of tabulating a character vector (as opposed to a factor). R is going to make this a factor for itself even if you don't.

> system.time(bamReadTabulation<-table(bamReads$qname))
   user  system elapsed 
 93.101   0.022  93.123

is equivalent to:

> system.time(bamReadFactor<-factor(bamReads$qname))
   user  system elapsed 
 91.605   0.027  91.628 
> system.time(bamReadTabulation<-table(bamReadFactor))
   user  system elapsed 
  3.820   0.009   3.827

I don't know of anyway of asking scanBam to treat qname as a factor or Rle. Even readBamGappedReads with use.names=TRUE returns a character vector.

I guess it's not a big priority for most people since multiple hits are the kind of thing that gets swept under the rug.

ADD COMMENT

Login before adding your answer.

Traffic: 3776 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6