Question

How to parallelize csaw pipeline?

0

Entering edit mode

4.6 years ago

jordi.planells ▴ 480

Hi all! I need your help. I am trying to implement a csaw pipeline with some ATAC-seq data. I have tried to use the windowCounts function with mclapply. It works smoothly, however I am facing some problems downstream. The main issue I am having is that the output of mclapply windowCounts is stored as a list (each element is the data computed for each of my .bam files) of RangedSummarizedExperiment. The problem comes when I try to use the function asDGEList from edgeR package, which is not taking lists as an input (see error below).

abundances = aveLogCPM(asDGEList(data50)) >= -1

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘asDGEList’ for signature ‘"list"’

data50[[1]]

class: RangedSummarizedExperiment 
dim: 1410524 1 
metadata(6): spacing width ... param final.ext
assays(1): counts
rownames: NULL
rowData names(0):
colnames: NULL
colData names(4): bam.files totals ext rlen

So here comes my question: Is there a way of using csaw parallelized without having lists as outputs?

Another solution would be to reduce all the RangedSummarizedExperiment elements into one. The issue here is that I have uneven number of rows, so I can't (or at least I don't know how) to do it.

Thank you before hand!

Have a great day,

Jordi

R csaw parallelization • 1.2k views

ADD COMMENT • link 4.6 years ago by jordi.planells ▴ 480

score 2 · Accepted Answer · 2020-04-22

Did you follow the csaw manual? It contains parallelization options. Don't use mclapply here. Page 14 of the manual: https://bioconductor.org/packages/3.10/workflows/vignettes/csawUsersGuide/inst/doc/csaw.pdf

Users can parallelize read counting and several other functions by setting the BPPARAM argument. This will load and process reads from multiple BAM files simultaneously. The number of workers and type of parallelization can be specified usingBiocParallelParam objects. By default, parallelization is turned off (i.e., set to aSerialParamobject) because it provides little benefit for small files or on systems with I/O bottleneck.

The idea is to have a single object where windows are rows and samples are columns. No need to do custom approaches, the authors have put together a comprehensive manual.