Entering edit mode
21 months ago
evocanres
▴
30
This question concerns an RNAseq data aligment and transcript quantification step that generates empty bam files but still generates counts file. I would love to know 1. If the count data is reliable if bam files are empty; and 2. How I can fix the issue. I am sharing my workflow, results and my search for solutions so far. Any suggestions would be appreciated.
I am working on a RNAseq dataset from JGI genome portal These are paired-end sequenced data, as noted from the description, but has 1 fastq.gz files per sample. So I followed the following steps
- Download data using Globus to hpc.
- Fastqc
- Deinterleave using https://telatin.github.io/seqfu2/tools/deinterleave.html
- run rsem-calculate-expression with the following script in a slurm job array.
rsem-calculate-expression --star \
--star-path /modules/apps/star/2.7.9a/bin/Linux_x86_64 \
-p 60 \
--paired-end \
--estimate-rspd \
--append-names \
--output-genome-bam \
"${prefix}_R1.fq" "${prefix}_R2.fq" \
$genomDIR/GRcm39 \
$rsemDIR/"${prefix}"
Results
- This is the
job.err
message
Loading star version 2.7.9a
Loading rsem version 1.3.3
rsem-tbam2gbam: BamConverter.h:131: void BamConverter::process(): Assertion `cqname != qname' failed.
slurm-calex_mm10_5589353.err (END)
- Checking the bam files and the genome.bam file is empty
10.genome.bam 0
11.genome.bam 0
12.genome.bam 1.0M
13.genome.bam 0
14.genome.bam 1.0M
15.genome.bam 0
8.genome.bam 0
9.genome.bam 1.0M
91.genome.bam 0
92.genome.bam 1.0M
93.genome.bam 1.0M
94.genome.bam 0
17.genome.bam 2.0M
19.genome.bam 0
20.genome.bam 0
21.genome.bam 1.0M
22.genome.bam 0
23.genome.bam 0
24.genome.bam 0
25.genome.bam 5.0M
26.genome.bam 0
27.genome.bam 0
28.genome.bam 1.0M
- The transcript bams are okay.
10.transcript.bam 13G
11.transcript.bam 5.7G
12.transcript.bam 11G
13.transcript.bam 7.4G
14.transcript.bam 5.4G
15.transcript.bam 12G
8.transcript.bam 8.2G
9.transcript.bam 9.9G
91.transcript.bam 12G
92.transcript.bam 7.1G
93.transcript.bam 6.6G
94.transcript.bam 11G
17.transcript.bam 6.5G
19.transcript.bam 11G
20.transcript.bam 9.0G
21.transcript.bam 9.4G
22.transcript.bam 9.8G
23.transcript.bam 7.1G
24.transcript.bam 11G
25.transcript.bam 11G
26.transcript.bam 12G
27.transcript.bam 11G
28.transcript.bam 7.9G
Attempts for solutions:
- One user in this thread mentions "I previously used the
--outSAMunmapped Within
option. By removing this option, I was able to get rsem-tbam2gbam to work. " This discussion on--outSAMunmapped Within
should not affect current analysis but reanalysis if the bam files are used later. - I tried looking at the BamConverter.h code but can not figure out for myself what the error message meant.
If you don't need the genome bam file, try skipping --output-genome-bam
This option generate a BAM file, with alignments mapped to genomic coordinates and annotated with their posterior probabilities.
I would like to have the BAM files.
I am just curious, in the 9 months since you posted there was no solution?
The error comes from this line of the source:
https://github.com/deweylab/RSEM/blob/8bc1e2115493c0cdf3c6bee80ef7a21a91b2acce/BamConverter.h#L131
I would first check that the input FASTQ files are properly paired.
If your input data is valid I would say that is probably best reported as a bug as
it seems that even the software authors added the assertion there as a sanity check, checking for an event that should never happen.