Question

Empty genes.bam files in RSEM-STAR workflow

1

Entering edit mode

2.5 years ago

evocanres ▴ 40

This question concerns an RNAseq data aligment and transcript quantification step that generates empty bam files but still generates counts file. I would love to know 1. If the count data is reliable if bam files are empty; and 2. How I can fix the issue. I am sharing my workflow, results and my search for solutions so far. Any suggestions would be appreciated.

I am working on a RNAseq dataset from JGI genome portal These are paired-end sequenced data, as noted from the description, but has 1 fastq.gz files per sample. So I followed the following steps

Download data using Globus to hpc.
Fastqc
Deinterleave using https://telatin.github.io/seqfu2/tools/deinterleave.html
run rsem-calculate-expression with the following script in a slurm job array.

rsem-calculate-expression --star \
                        --star-path /modules/apps/star/2.7.9a/bin/Linux_x86_64 \
                        -p 60 \
                        --paired-end \
                        --estimate-rspd \
                        --append-names \
                        --output-genome-bam \
                        "${prefix}_R1.fq" "${prefix}_R2.fq" \
                        $genomDIR/GRcm39 \
                        $rsemDIR/"${prefix}"

Results

This is the job.err message

Loading star version 2.7.9a
Loading rsem version 1.3.3
rsem-tbam2gbam: BamConverter.h:131: void BamConverter::process(): Assertion `cqname != qname' failed.
slurm-calex_mm10_5589353.err (END)

Checking the bam files and the genome.bam file is empty

10.genome.bam 0
11.genome.bam 0
12.genome.bam 1.0M
13.genome.bam 0
14.genome.bam 1.0M
15.genome.bam 0
8.genome.bam 0
9.genome.bam 1.0M
91.genome.bam 0
92.genome.bam 1.0M
93.genome.bam 1.0M
94.genome.bam 0
17.genome.bam 2.0M
19.genome.bam 0
20.genome.bam 0
21.genome.bam 1.0M
22.genome.bam 0
23.genome.bam 0
24.genome.bam 0
25.genome.bam 5.0M
26.genome.bam 0
27.genome.bam 0
28.genome.bam 1.0M

The transcript bams are okay.

10.transcript.bam 13G
11.transcript.bam 5.7G
12.transcript.bam 11G
13.transcript.bam 7.4G
14.transcript.bam 5.4G
15.transcript.bam 12G
8.transcript.bam 8.2G
9.transcript.bam 9.9G
91.transcript.bam 12G
92.transcript.bam 7.1G
93.transcript.bam 6.6G
94.transcript.bam 11G
17.transcript.bam 6.5G
19.transcript.bam 11G
20.transcript.bam 9.0G
21.transcript.bam 9.4G
22.transcript.bam 9.8G
23.transcript.bam 7.1G
24.transcript.bam 11G
25.transcript.bam 11G
26.transcript.bam 12G
27.transcript.bam 11G
28.transcript.bam 7.9G

Attempts for solutions:

One user in this thread mentions "I previously used the --outSAMunmapped Within option. By removing this option, I was able to get rsem-tbam2gbam to work. " This discussion on --outSAMunmapped Within should not affect current analysis but reanalysis if the bam files are used later.
I tried looking at the BamConverter.h code but can not figure out for myself what the error message meant.

star rsem • 1.4k views

ADD COMMENT • link updated 21 months ago by Istvan Albert 103k • written 2.5 years ago by evocanres ▴ 40

0

Entering edit mode

If you don't need the genome bam file, try skipping --output-genome-bam

This option generate a BAM file, with alignments mapped to genomic coordinates and annotated with their posterior probabilities.

ADD REPLY • link 22 months ago by unetmailo • 0

0

Entering edit mode

I would like to have the BAM files.

ADD REPLY • link 21 months ago by evocanres ▴ 40

0

Entering edit mode

I am just curious, in the 9 months since you posted there was no solution?

The error comes from this line of the source:

https://github.com/deweylab/RSEM/blob/8bc1e2115493c0cdf3c6bee80ef7a21a91b2acce/BamConverter.h#L131

I would first check that the input FASTQ files are properly paired.

If your input data is valid I would say that is probably best reported as a bug as

it seems that even the software authors added the assertion there as a sanity check, checking for an event that should never happen.

ADD REPLY • link 21 months ago by Istvan Albert 103k