Question

Sorted bam files have zero content

0

Entering edit mode

18 months ago

mgranada3 ▴ 60

I am trying to call variants on my DNA-seq information. However, I continue to get stuck in the same step where I sort my bam file using Picard. When I use "du -sh *" to see the file size, the sorted.bam file that gets produced has nothing in it.

enter image description here

This is the code I am using on my bam files to create my sorted bam file.

#!/bin/bash
#SBATCH -J sc1_10bamSORT
#SBATCH -A gts-rro
#SBATCH -N 1 --ntasks-per-node=24
#SBATCH --mem-per-cpu=8G
#SBATCH -t 36:00:00
#SBATCH -o Report-%j.out

cd $SLURM_SUBMIT_DIR
ml picard/3.0.0

java -jar /usr/local/pace-apps/manual/packages/picard/3.0.0/build/libs/picard.jar SortSam -I /storage/RGBam/FR04_SC1_10.bam -O /storage/Sorted/FR04_SC1_10.sorted.bam --SORT_ORDER coordinate

In my pipeline, 1. I concatenate my fastq.gz files, 2. I trim them using Trimmomatic, 3. I create my index using the reference genome and Bwa-Mem2, 4. I map my reads using Bwa-Mem2 to produce my .sam files, 5. I use SAMtools to convert my .sam files to .bam files and add read groups, LASTLY, my problem step, 6. I sort my bam files using Picard to generate my sorted.bam files needed to mark duplicates using Picard.

The first time I did this, I had 15 files have this problem. This time I have 10 files with this problem. They are the same files but somehow 5 got resolved this round. How can I assure myself that what has been “resolved” consists of high quality data? And what could be causing problems when I sort my bam files?

bam picard DNA-seq sorted • 1.3k views

ADD COMMENT • link updated 18 months ago by Michael 56k • written 18 months ago by mgranada3 ▴ 60

0

Entering edit mode

Hi,

First, have you checked if the unsorted bam files have data in them ? There might be an error in the previous step.

There could be many reasons why the file is empty, do you have any log files or error messages from your commands ?

Since you are using a workload manager (slurm), another thing to verify is that you are requesting enough time and ressources for your analysis.

ADD REPLY • link 18 months ago by Corentin ▴ 630

0

Entering edit mode

Yes the unsorted bam files have data in them and they were all of comparable size. I went back a few steps and this is the case for my sam and fastq files. So everything seems consistent there.
In previous runs I have also tried increasing my time and resources but was running into this same problem.
When I checked the logs is said these files had 0 paired reads, which is untrue because my paired.fastq files had data in them and still of comparable size to the other files.

What resolved this problem was using samtools to sort instead of Picard.

ADD REPLY • link 18 months ago by mgranada3 ▴ 60

score 2 · Answer 1 · 2024-01-10

Just some simple things in addition to the other recommendations. It is very likely that your processes get killed because it ran out of memory without you noticing.

Use samtools to sort the bam file instead of Picard, it may be more memory efficient.
This is most likely going to fix it: If you use a java program, make sure to set the max heap size: java -Xmx8G -jar ... (increase to the maximum memory available if that is not enough, the default setting is often way too low for any serious job)

Set a separate output for error reports:
```
#SBATCH --error=Report-%j.err
```

If the error persists, the reason will likely be printed to that file.

8GB memory per CPU doesn't sound like much, try to increase this to the maximum available, or remove that option completely or replace with
```
#SBATCH --mem=256G # or whatever the maximum allocation is
```

Don't use ntasks-per-node unless the application is MPI enabled, try --cpus-per-task 8 as a first step, that may be a safer choice but make sure your app is multithreading enabled.