Hello,
I'm working with bulk-RNA sequencing data, specifically with BAM files containing unmapped reads of around 450MB each. The reads are 35bp in length. I'm using GATK4's PathSeqPipelineSpark to analyze the unmapped components with the following command:
gatk --java-options "-Xmx200g -Djava.io.tmpdir=./gatk-temp/ -XX:+UseG1GC -XX:MaxGCPauseMillis=200" PathSeqPipelineSpark \
--input ./unmapped.bam \
--min-clipped-read-length 35 \
--microbe-dict ./pathseq_microbe.dict \
--microbe-bwa-image ./pathseq_microbe.fa.img \
--taxonomy-file ./pathseq_taxonomy.db \
--output ./unmapped.pathseq.bam \
--scores-output ./unmapped.pathseq.txt \
--score-metrics ./unmapped.scores.txt \
--filter-metrics ./unmapped.filter.metrics.txt \
--is-host-aligned true \
--filter-duplicates false \
--divide-by-genome-length true \
--conf spark.local.dir=./spark-temp \
--tmp-dir ./gatk-temp \
--read-filter WellformedReadFilter \
--conf spark.master=local[20]
However, I'm experiencing extremely long runtimes. It has taken over 7 days, and the process still hasn't finished for a single 450MB BAM file. Is this expected for such input sizes and read length (35bp)? I feel this is unusually slow and would greatly appreciate any advice.
Could there be any issues with my configuration or parameters that might be causing this delay? Is there anything I could adjust to improve performance given that my reads are only 35bp long?
Thanks in advance for your help!
Best regards, tongzhen