Question

Performance Issues with GATK4 PathSeqPipelineSpark on 35bp Bulk-RNA Unmapped Reads

0

Entering edit mode

8 months ago

xutongzhen1996 • 0

Hello,

I'm working with bulk-RNA sequencing data, specifically with BAM files containing unmapped reads of around 450MB each. The reads are 35bp in length. I'm using GATK4's PathSeqPipelineSpark to analyze the unmapped components with the following command:

 gatk --java-options "-Xmx200g -Djava.io.tmpdir=./gatk-temp/ -XX:+UseG1GC -XX:MaxGCPauseMillis=200" PathSeqPipelineSpark \
        --input ./unmapped.bam \
        --min-clipped-read-length 35 \
        --microbe-dict ./pathseq_microbe.dict \
        --microbe-bwa-image ./pathseq_microbe.fa.img \
        --taxonomy-file ./pathseq_taxonomy.db \
        --output ./unmapped.pathseq.bam \
        --scores-output ./unmapped.pathseq.txt \
        --score-metrics ./unmapped.scores.txt \
        --filter-metrics ./unmapped.filter.metrics.txt \
        --is-host-aligned true \
        --filter-duplicates false \
        --divide-by-genome-length true \
        --conf spark.local.dir=./spark-temp \
        --tmp-dir ./gatk-temp \
        --read-filter WellformedReadFilter \
        --conf spark.master=local[20]

However, I'm experiencing extremely long runtimes. It has taken over 7 days, and the process still hasn't finished for a single 450MB BAM file. Is this expected for such input sizes and read length (35bp)? I feel this is unusually slow and would greatly appreciate any advice.

Could there be any issues with my configuration or parameters that might be causing this delay? Is there anything I could adjust to improve performance given that my reads are only 35bp long?

Thanks in advance for your help!

Best regards, tongzhen

pathseq • 514 views

ADD COMMENT • link updated 8 months ago by colindaven 7.6k • written 8 months ago by xutongzhen1996 • 0

score 0 · Answer 1 · 2024-10-07

Wow, where did you get 35bp reads from? I haven't seen reads of that length for well over 10 years. I doubt you'll get anything of any reliability from such short reads, especially when mapping against multiple bacterial taxa. The taxonomic information is so low that you'll get massive mismapping between gene families, related genomes, rRNA and so on.