MarkduplicatesSpark How to speed-up ?
0
0
Entering edit mode
3.3 years ago
quentin54520 ▴ 120

Hello all,

I would like to know if there is any good option to speed up MarkduplicatesSpark ? I work with human genome with arround 900 millions reads (151 bp).

I work on a cluster (with slurm).

The command that i used is (with 60G of memory and 14 cpu) :

gatk --java-options "-Xmx${SLURM_MEM_PER_NODE}M" MarkDuplicatesSpark \
-I ${BAM_INPUT_DIR}/${BAM_INPUT} \
-O ${BAM_OUTPUT_DIR}/${BAM_OUTPUT}    \
-M ${Markduplicate_metrics_DIR}/${BAM_INPUT}.metrics.txt \
--tmp-dir ${tmp_dir} \
--create-output-bam-index false \
-- --spark-master local[${SLURM_CPUS_PER_TASK}] 2> ${LOGS_DIR}/${BAM_INPUT}.log

Before running markduplicate i did : -fastp to trimmed the fastq -bwa mem 2 -samtools view

I supposed that my bam is sorted by query name as i didn't do any sort step but how could i be sure ?

It took more than 1 day to finish (one file is finish after 1 day and 5 hours, the other are still running).

Please let me know if y could do anything to speed up.

Thanks in advance

Quentin

spark gatk genome • 1.5k views
ADD COMMENT
0
Entering edit mode

Don't you think that sorting by coordinate would help?

ADD REPLY
0
Entering edit mode

Actually it's the opposite :

"The tool is optimized to run on queryname-grouped alignments (that is, all reads with the same queryname are together in the input file). If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances."

ADD REPLY
0
Entering edit mode

Maybe the problem is that in the bam file it's written that it's unsorted. But i think that i could assume that the bam file is query grouped, i will see if it improve.

ADD REPLY

Login before adding your answer.

Traffic: 2334 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6