Question

MarkduplicatesSpark How to speed-up ?

0

Entering edit mode

3.3 years ago

quentin54520 ▴ 120

Hello all,

I would like to know if there is any good option to speed up MarkduplicatesSpark ? I work with human genome with arround 900 millions reads (151 bp).

I work on a cluster (with slurm).

The command that i used is (with 60G of memory and 14 cpu) :

gatk --java-options "-Xmx${SLURM_MEM_PER_NODE}M" MarkDuplicatesSpark \
-I ${BAM_INPUT_DIR}/${BAM_INPUT} \
-O ${BAM_OUTPUT_DIR}/${BAM_OUTPUT}    \
-M ${Markduplicate_metrics_DIR}/${BAM_INPUT}.metrics.txt \
--tmp-dir ${tmp_dir} \
--create-output-bam-index false \
-- --spark-master local[${SLURM_CPUS_PER_TASK}] 2> ${LOGS_DIR}/${BAM_INPUT}.log

Before running markduplicate i did : -fastp to trimmed the fastq -bwa mem 2 -samtools view

I supposed that my bam is sorted by query name as i didn't do any sort step but how could i be sure ?

It took more than 1 day to finish (one file is finish after 1 day and 5 hours, the other are still running).

Please let me know if y could do anything to speed up.

Thanks in advance

Quentin

spark gatk genome • 1.5k views

ADD COMMENT • link 3.3 years ago by quentin54520 ▴ 120

0

Entering edit mode

Don't you think that sorting by coordinate would help?

ADD REPLY • link 3.3 years ago by swbarnes2 14k

0

Entering edit mode

Actually it's the opposite :

"The tool is optimized to run on queryname-grouped alignments (that is, all reads with the same queryname are together in the input file). If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances."

ADD REPLY • link 3.3 years ago by quentin54520 ▴ 120

0

Entering edit mode

Maybe the problem is that in the bam file it's written that it's unsorted. But i think that i could assume that the bam file is query grouped, i will see if it improve.

ADD REPLY • link 3.3 years ago by quentin54520 ▴ 120