I have encountered some bugs in transcriptome analysis using STAR and RSEM.
1
0
Entering edit mode
7 weeks ago
Pallondyle • 0

I'm trying to analyze a set of RNA-seq data, and a subset of my data hasn't generated the results. The code I use:

for i in {KO1_1,KO1_2,KO1_3,KO2_1,KO2_2,KO2_3,KO_Ctrl_1,KO_Ctrl_2,KO_Ctrl_3,OE1_1,OE1_2,OE1_3,OE_NC_1,OE_NC_2,OE_NC_3};
do
## STAR alignment
STAR --twopassMode Basic \
    --quantMode TranscriptomeSAM GeneCounts \
    --runThreadN 6 \
    --genomeDir index/STAR_index_hg38+V41 \
    --alignIntronMin 20 \
    --alignIntronMax 50000 \
    --outSAMtype BAM SortedByCoordinate \
    --sjdbOverhang 149 \
    --outSAMattrRGline ID:sample SM:sample PL:ILLUMINA \
    --outFilterMismatchNmax 2 \
    --outSJfilterReads Unique \
    --outSAMmultNmax 1 \
    --outFileNamePrefix ${i} \
    --outSAMmapqUnique 60 \
    --readFilesCommand gunzip -c \
    --readFilesIn rawdata/Bianlab_RNA_seq/${i}_1.fq.gz rawdata/Bianlab_RNA_seq/${i}_2.fq.gz

## RSEM count
rsem-calculate-expression --paired-end --no-bam-output \
--alignments -p 15 \
--bam ${i}Aligned.toTranscriptome.out.bam \
index/RSEM_index_hg38+V41/RSEM_index_hg38+V41 \
intermediate_data/${i}_

done

I tried to rerun the groups that did not generate results, and encountered following error:

rsem-calculate-expression --paired-end --no-bam-output \
> --alignments -p 15 \
> --bam OE1_2Aligned.toTranscriptome.out.bam \
> index/RSEM_index_hg38+V41/RSEM_index_hg38+V41 \
> intermediate_data/OE1_2_
rsem-parse-alignments index/RSEM_index_hg38+V41/RSEM_index_hg38+V41 intermediate_data/OE1_2_.temp/OE1_2_ intermediate_data/OE1_2_.stat/OE1_2_ OE1_2Aligned.toTranscriptome.out.bam 3 -tag XM
[samopen] no @SQ lines in the header.
The SAM/BAM file declares less than one reference sequence!
"rsem-parse-alignments index/RSEM_index_hg38+V41/RSEM_index_hg38+V41 intermediate_data/OE1_2_.temp/OE1_2_ intermediate_data/OE1_2_.stat/OE1_2_ OE1_2Aligned.toTranscriptome.out.bam 3 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

I tried to check the bam file, however, the result is as:

samtools view OE1_2Aligned.toTranscriptome.out.bam
[main_samview] fail to read the header from "OE1_2Aligned.toTranscriptome.out.bam".

I am currently confused about which step's parameters have been called incorrectly, preventing me from obtaining the results.

STAR RSEM • 984 views
ADD COMMENT
0
Entering edit mode

what is the output of

file *.out.bam

?

ADD REPLY
0
Entering edit mode

oops

file *.out.bam
KO1_2Aligned.sortedByCoord.out.bam:       gzip compressed data, extra field
KO1_2Aligned.toTranscriptome.out.bam:     gzip compressed data, extra field
KO1_3Aligned.sortedByCoord.out.bam:       gzip compressed data, extra field
KO1_3Aligned.toTranscriptome.out.bam:     gzip compressed data, extra field
KO2_1Aligned.sortedByCoord.out.bam:       gzip compressed data, extra field
KO2_1Aligned.toTranscriptome.out.bam:     gzip compressed data, extra field
KO2_2Aligned.sortedByCoord.out.bam:       gzip compressed data, extra field
KO2_2Aligned.toTranscriptome.out.bam:     gzip compressed data, extra field
KO2_3Aligned.sortedByCoord.out.bam:       gzip compressed data, extra field
KO2_3Aligned.toTranscriptome.out.bam:     gzip compressed data, extra field
KO_Ctrl_1Aligned.sortedByCoord.out.bam:   gzip compressed data, extra field
KO_Ctrl_1Aligned.toTranscriptome.out.bam: gzip compressed data, extra field
KO_Ctrl_2Aligned.sortedByCoord.out.bam:   gzip compressed data, extra field
KO_Ctrl_2Aligned.toTranscriptome.out.bam: gzip compressed data, extra field
KO_Ctrl_3Aligned.sortedByCoord.out.bam:   gzip compressed data, extra field
KO_Ctrl_3Aligned.toTranscriptome.out.bam: gzip compressed data, extra field
OE1_1Aligned.sortedByCoord.out.bam:       gzip compressed data, extra field
OE1_1Aligned.toTranscriptome.out.bam:     gzip compressed data, extra field
OE1_2Aligned.sortedByCoord.out.bam:       empty
OE1_2Aligned.toTranscriptome.out.bam:     empty
OE1_3Aligned.sortedByCoord.out.bam:       empty
OE1_3Aligned.toTranscriptome.out.bam:     empty
OE_NC_1Aligned.sortedByCoord.out.bam:     empty
OE_NC_1Aligned.toTranscriptome.out.bam:   empty
OE_NC_2Aligned.sortedByCoord.out.bam:     gzip compressed data, extra field
OE_NC_2Aligned.toTranscriptome.out.bam:   gzip compressed data, extra field
OE_NC_3Aligned.sortedByCoord.out.bam:     gzip compressed data, extra field
OE_NC_3Aligned.toTranscriptome.out.bam:   gzip compressed data, extra field
ADD REPLY
0
Entering edit mode

However, OE1_2 and OE1_3 are not the only groups experiencing problems. I tried another non-empty file, and:

rsem-calculate-expression --paired-end --no-bam-output \
> --alignments -p 15 \
> --bam OE_NC_2Aligned.toTranscriptome.out.bam \
> index/RSEM_index_hg38+V41/RSEM_index_hg38+V41 \
> intermediate_data/OE_NC_2_
rsem-parse-alignments index/RSEM_index_hg38+V41/RSEM_index_hg38+V41 intermediate_data/OE_NC_2_.temp/OE_NC_2_ intermediate_data/OE_NC_2_.stat/OE_NC_2_ OE_NC_2Aligned.toTranscriptome.out.bam 3 -tag XM
Done!

rsem-build-read-index 32 1 0 intermediate_data/OE_NC_2_.temp/OE_NC_2__alignable_1.fq intermediate_data/OE_NC_2_.temp/OE_NC_2__alignable_2.fq
Cannot open intermediate_data/OE_NC_2_.temp/OE_NC_2__alignable_1.fq! It may not exist.
"rsem-build-read-index 32 1 0 intermediate_data/OE_NC_2_.temp/OE_NC_2__alignable_1.fq intermediate_data/OE_NC_2_.temp/OE_NC_2__alignable_2.fq" failed! Plase check if you provide correct parameters/options for the pipeline!
ADD REPLY
0
Entering edit mode

The most torturous thing is that some groups did indeed generate results.

ls intermediate_data
KO1_1_.stat              KO1_3_.isoforms.results  KO2_2_.isoforms.results   KO_Ctrl_1_.isoforms.results  KO_Ctrl_3_.isoforms.results  OE1_2_.temp    OE_NC_2_.temp
KO1_1_.temp              KO1_3_.stat              KO2_2_.stat               KO_Ctrl_1_.stat              KO_Ctrl_3_.stat              OE1_3_.stat    OE_NC_3_.stat
KO1_2_.genes.results     KO2_1_.genes.results     KO2_3_.genes.results      KO_Ctrl_2_.genes.results     OE1_1_.genes.results         OE1_3_.temp    OE_NC_3_.temp
KO1_2_.isoforms.results  KO2_1_.isoforms.results  KO2_3_.isoforms.results   KO_Ctrl_2_.isoforms.results  OE1_1_.isoforms.results      OE_NC_1_.stat
KO1_2_.stat              KO2_1_.stat              KO2_3_.stat               KO_Ctrl_2_.stat              OE1_1_.stat                  OE_NC_1_.temp
KO1_3_.genes.results     KO2_2_.genes.results     KO_Ctrl_1_.genes.results  KO_Ctrl_3_.genes.results     OE1_2_.stat                  OE_NC_2_.stat

I'm so confused...

ADD REPLY
0
Entering edit mode

Cannot open intermediate_data/OE_NC_2_.temp/OE_NC_2__alignable_1.fq! It may not exist.

Looks like you are missing some files.

BTW this is not a bugsince those are generally problems associated with programs themselves. This seems to be an issue with your execution and/or data.

ADD REPLY
0
Entering edit mode

Their .fq files all look similar, should I rerun these programs?

ADD REPLY
0
Entering edit mode

Since we don't have access to your data we can't comment. Check the integrity of your input data files. Validate the files if you are not sure about their contents.

ADD REPLY
0
Entering edit mode

It seems that the files that generated the results also have significant issues. Except for one sample, all other samples only have over 200 transcripts with non-zero counts.

> table(df$intermediate_data.KO1_2_.genes.results==0)

FALSE  TRUE 
  295 20934
ADD REPLY
0
Entering edit mode

One of them looks okay, but I don't quite believe it.

> table(df$`intermediate_data.OE1_1_.genes.results`==0)

FALSE  TRUE 
21200    29
ADD REPLY
0
Entering edit mode

Pallondyle : Use ADD REPLY/ADD COMMENT to add additional information to existing threads. ADD YOUR ANSWER is meant for adding new answers to the original question.

ADD REPLY
0
Entering edit mode

Can you check using featureCounts to make sure things look reasonable with alignments? You could also use salmon to do transcript quantitation instead of rsem.

ADD REPLY
0
Entering edit mode

Thx. I found out that I used the wrong reference genome. I have a set of data from mice and a set from humans. I processed the human data using the mouse reference genome and the mouse data using the human reference genome.

ADD REPLY
0
Entering edit mode
7 weeks ago
Michael 55k

Something went wrong during star alignments of samples OE1_2, OE1_3, and OE_NC_1. The obvious place to look for the reason is in the Star logs.

Try cat OE1_2*Log*. This should point you to the root cause of the problem.

ADD COMMENT
0
Entering edit mode

Thx. It turns out that the file was damaged during transmission.

ADD REPLY
0
Entering edit mode

Ok, that's quite common but can cause a lot of fuzz (if you notice it, even more if it goes unnoticed). Try to check the md5 sums generated by the sequencing facility and do a round of fastp->multiqc before each analysis run.

ADD REPLY

Login before adding your answer.

Traffic: 2337 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6