Hello all,
I would like to know if there is any good option to speed up MarkduplicatesSpark ? I work with human genome with arround 900 millions reads (151 bp).
I work on a cluster (with slurm).
The command that i used is (with 60G of memory and 14 cpu) :
gatk --java-options "-Xmx${SLURM_MEM_PER_NODE}M" MarkDuplicatesSpark \
-I ${BAM_INPUT_DIR}/${BAM_INPUT} \
-O ${BAM_OUTPUT_DIR}/${BAM_OUTPUT} \
-M ${Markduplicate_metrics_DIR}/${BAM_INPUT}.metrics.txt \
--tmp-dir ${tmp_dir} \
--create-output-bam-index false \
-- --spark-master local[${SLURM_CPUS_PER_TASK}] 2> ${LOGS_DIR}/${BAM_INPUT}.log
Before running markduplicate i did : -fastp to trimmed the fastq -bwa mem 2 -samtools view
I supposed that my bam is sorted by query name as i didn't do any sort step but how could i be sure ?
It took more than 1 day to finish (one file is finish after 1 day and 5 hours, the other are still running).
Please let me know if y could do anything to speed up.
Thanks in advance
Quentin
Don't you think that sorting by coordinate would help?
Actually it's the opposite :
"The tool is optimized to run on queryname-grouped alignments (that is, all reads with the same queryname are together in the input file). If provided coordinate-sorted alignments, the tool will spend additional time first queryname sorting the reads internally. This can result in the tool being up to 2x slower processing under some circumstances."
Maybe the problem is that in the bam file it's written that it's unsorted. But i think that i could assume that the bam file is query grouped, i will see if it improve.