Question

Read counting in RSEM and Salmon (alignment mode)

0

Entering edit mode

2.3 years ago

predeus ★ 2.1k

Hi all,

I was wondering if someone can clarify an issue for me. When counting reads for RNA-seq, you can use EM-based algorithms to correctly assign multimapping reads. Thus, if you have 5 single-end reads that map to 1 transcript, and 5 reads that map to 2, your BAM file will have 15 lines overall, but your quantification table should have read counts that sum to 10.

I'm trying to run a bit of a funky set-up when I quantify gene expression from a bunch of bacteria using RSEM or Salmon. However, I used each species as a "gene" and each chromosome/plasmid as a "transcript". What is surprising to me though is that a read that maps to e.g. 5 different places (16S from 5 similar operons) in the same bacterium is counted as 5 reads, not one. This behaviour is the same in both RSEM and Salmon (alignment-mode - that is, starting from a BAM).

My question is, how do these tools track unique reads? I thought it would be via NH or HI tags, but it doesn't seem to be the case - in case of a normal "transcriptomic" human BAM file Salmon works perfectly correctly without any tags at all.

If anybody has any suggestions, I would be very grateful.

quantification RSEM RNA-seq Salmon • 2.6k views

ADD COMMENT • link 2.3 years ago by predeus ★ 2.1k

0

Entering edit mode

Salmon uses EM to resolve multimappers by reassigning abundance estimates for each transcript isoform based on the abundance of reads that can be resolved uniquely to an isoform.

For example, imagine a gene with 3 exons and 2 isoforms, where isoform A has all 3 exons but isoform B only has two. You perform short-read RNA-seq and most reads are multimapping between the two transcript isoforms but you have no coverage of the exon unique to isoform A. You can safely assume that isoform B was the expressed isoform and thus shift the multimappers towards the abundance estimate of isoform B. This principle extends to reads multimapping to different genes also, so for example multimappers between paralogs can be resolved at the abundance level if there is some sequence variation between the paralogs.

I worry about making an entire plasmid or chromosome a transcript. Salmon reports back abundance estimates that are corrected for transcript size, so I'm not sure what effect it would have on abundance estimates with such an incorrect size. I'm also not sure whether the sparsity in coverage (due to large regions not transcribed) will effect EM.

The author of Salmon is a regular here, so perhaps they could chime in and give a much more informed take on this than me.

ADD REPLY • link 2.3 years ago by rpolicastro 13k

1

Entering edit mode

HI,

Salmon reports back abundance estimates that are corrected for transcript size

I'm pretty sure that should only concern TPM/FPKM, and I was interested in "raw" read counts. You are right in that TPM/FPKM would look wild - they do.

Rob was very kind to have answered in the Github issues, and the problem was with BAM sorting: it needs to be sorted by read name. This happens by default for transcriptomic BAM generated by e.g. STAR, but since my BAM is genomic, it was breaking everything.

ADD REPLY • link 2.3 years ago by predeus ★ 2.1k

score 2 · Accepted Answer · 2023-04-21

2

Entering edit mode

2.3 years ago

predeus ★ 2.1k

UPD: Problem solved, thanks to the kind (and lightning-fast) comment from Rob Patro here!

It was the BAM sorting - the file needs to be read name-sorted in order for RSEM/Salmon to work correctly.

ADD COMMENT • link 2.3 years ago by predeus ★ 2.1k