Performance Issues with GATK4 PathSeqPipelineSpark on 35bp Bulk-RNA Unmapped Reads
1
0
Entering edit mode
4 months ago

Hello,

I'm working with bulk-RNA sequencing data, specifically with BAM files containing unmapped reads of around 450MB each. The reads are 35bp in length. I'm using GATK4's PathSeqPipelineSpark to analyze the unmapped components with the following command:

 gatk --java-options "-Xmx200g -Djava.io.tmpdir=./gatk-temp/ -XX:+UseG1GC -XX:MaxGCPauseMillis=200" PathSeqPipelineSpark \
        --input ./unmapped.bam \
        --min-clipped-read-length 35 \
        --microbe-dict ./pathseq_microbe.dict \
        --microbe-bwa-image ./pathseq_microbe.fa.img \
        --taxonomy-file ./pathseq_taxonomy.db \
        --output ./unmapped.pathseq.bam \
        --scores-output ./unmapped.pathseq.txt \
        --score-metrics ./unmapped.scores.txt \
        --filter-metrics ./unmapped.filter.metrics.txt \
        --is-host-aligned true \
        --filter-duplicates false \
        --divide-by-genome-length true \
        --conf spark.local.dir=./spark-temp \
        --tmp-dir ./gatk-temp \
        --read-filter WellformedReadFilter \
        --conf spark.master=local[20]

However, I'm experiencing extremely long runtimes. It has taken over 7 days, and the process still hasn't finished for a single 450MB BAM file. Is this expected for such input sizes and read length (35bp)? I feel this is unusually slow and would greatly appreciate any advice.

Could there be any issues with my configuration or parameters that might be causing this delay? Is there anything I could adjust to improve performance given that my reads are only 35bp long?

Thanks in advance for your help!

Best regards, tongzhen

pathseq • 380 views
ADD COMMENT
0
Entering edit mode
4 months ago

Wow, where did you get 35bp reads from? I haven't seen reads of that length for well over 10 years. I doubt you'll get anything of any reliability from such short reads, especially when mapping against multiple bacterial taxa. The taxonomic information is so low that you'll get massive mismapping between gene families, related genomes, rRNA and so on.

ADD COMMENT

Login before adding your answer.

Traffic: 2607 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6