Entering edit mode
18 months ago
predeus
★
2.1k
Hi all,
I am using bbduk.sh
and was wondering if there's an efficient way to process multiple sets of reads with it? E.g. if you have read 1 as 4 separate files and read 2 as 4 separate files, typical mappers like bowtie2
or STAR
support the comma-separated syntax.
Concatenating files seems like a waste of I/O which is under heavy stress already, when we process many samples.
Have you tried using process substitution or a named pipe? Efficient way of processing data would still be starting 4 jobs in parallel.
Thank you for the suggestions! Process substitution fails and I'm not sure why - something in its I/O block chokes on stdin I think? The errors don't make much sense. Haven't tried named pipe yet.
Bbduk is already extremely efficient, so even doing things sequentially is actually OK - what takes longer is concatenating the sequences after (and, if they are large, which is often, this causes 100's of GB of unnecessary redundant I/O load). For now I have the "extensive" solution, but I'll post here if I find something that's efficient and sleek.
Merging BAM's at a point further down the workflow would likely be the most efficient way since samtools can do it multi-threaded.