Currently our lab are facing a great difficulty when we suddenly obtain 700+ exome sequencing data. Our server starts to fail due to the crazy amount of io involved when performing the GATK. The GATK pipeline, together with the picard tools create a large amount of intermediate files and that add on a huge burden to our server. I have recently discovered the bcbio-nextgen and are still testing its performance. However, I also wonder whether if there are simpler way to to solve this problem (e.g. piping as many steps as possible). For example, this useful tutorial has taught how to perform the analysis up till markduplication. However, that was using the rmdup from samtools instead of the recommended picards MarkDuplicates. I have checked the documentation of picards, and it was stated that
Some Picard programs, e.g. MarkDuplicates, cannot read its input from
stdin, because it makes multiple passes over the input file.
So is there any recommended alternatives to the MarkDuplicates than the one in picards such that piping is supported?
Currently I am trying to do this:
bwa mem -M <Ref> <Read1.fq> <Read2.fq> | samtools sort - - | samtools view -bSh - | tee <Sample>.sorted.bam | samtools index -
Two alternatives to MarkDuplicates that support streaming are:
samtools rmdup -- This is generally considered slightly inferior to the algorithm used in MarkDuplicates due to not supporting some edge cases (see this Samtools + Picard Markduplicates). It does support streaming so the performance tradeoff is often worthwhile for large datasets. bcbio-nextgen currently supports this as a standard option.
biobambam's markduplicates2 -- This implements the MarkDuplicates algorithm but with improved efficiency. See the paper for more details. This does still require a two pass algorithm but makes use of temporary space so you can redirect the IO to a local high speed disk instead of networked disk. The man page has more details on the algorithm.
The goal is to support biobambam in bcbio-nextgen and I hope this will available in the short term. More generally, I'd love to hear about scaling issues you find. The goal of bcbio-nextgen is to be a community resource and pipeline handling exactly these kind of scaling issues, so feedback would be incredibly useful.
I am really looking forward to the updates in bcbio-nextgen, currently I am timing the runtime for bcbio-nextgen and the traditional pipeline (maximizing the pipeline and using GATK). Hopefully I can comeback with the results soon
Q: Can Picard programs read from stdin and write to stdout?
A: Most Picard programs can do this. To read from stdin, specify
/dev/stdin as the input file. To write to stdout, specify /dev/stdout
as the output file, and also add the option QUIET=true to suppress
other messages that otherwise would be written to stdout. Not that
Picard determines whether to write in SAM or BAM format by examining
the file extension of the output file. Since /dev/stdout ends in
neither .sam nor .bam, Picard defaults to writing in BAM format. Some
Picard programs, e.g. MarkDuplicates, cannot read its input from
stdin, because it makes multiple passes over the input file. When
writing a BAM to stdout so that it can be read by another program,
passing the argument COMPRESSION_LEVEL=0 to the program writing the BAM to stdout can reduce unnecessary computation.
Hi Pierre, thank you for your answer. I have checked the picard manual, however, I would really like to know if there are any other alternatives of Markduplicates allowing piping.
no if you want to use MarkDuplicates. if piping is preferable above the algorithm selection maybe you could consider using samtools' rmdup, although I would suggest sticking to MarkDuplicates and live with that intermediate temporal file.
I am really looking forward to the updates in bcbio-nextgen, currently I am timing the runtime for bcbio-nextgen and the traditional pipeline (maximizing the pipeline and using GATK). Hopefully I can comeback with the results soon