Question

STAR vs. Kallisto

1

Entering edit mode

3 months ago

gogeni5529 ▴ 80

I was wondering how do I know when to use STAR to map my reads to the genome and when it is better to use kallisto to align the reads. From reading about the two I think, I understand the difference between the two tools.

I have a data set, where we knockout a specific gene. After sequencing I expected to see a huge different between the WT and the KO, but this was visible only in the kallisto-quantified data. Using STAR-FeatureCounts found almost no reads mapped to this gene. When running FeatureCounts with multiOverlap parmeter

rowname Sample_1    Sample_2    Sample_3    Sample_13   Sample_14   Sample_15
STAR_1         2            2          2            6           0           1
Kallisto    1956        2164        2429            1           0           1
multiOl     6171        6603        4355              353         548         469

When looking at the reds in the bam files, there is a clear difference between the expression of my gene of interest between the two samples ( in the image below I show the difference between samples 1 and 13). The gtf file show there are two genes in this region and from I can see, the left part of the reads can be mapped to the first gene (red squae), but on the right-hand side these reads are clearly mapped to my gene of interest (in green square below).

The point of my question here is to understand why STAR doesn't find the same behavior as kallisto and maybe even more important is it possible to set STAR in such a way to behave the same as kallisto?

I appreciate your help

gene_of_interest

STAR kallisto alignment RNA-Seq mapping • 1.1k views

ADD COMMENT • link updated 3 months ago by dsull ★ 7.6k • written 3 months ago by gogeni5529 ▴ 80

score 4 · Answer 1 · 2025-04-25

I use both STAR and kallisto extensively (and I help develop the latter).

kallisto is nice when you care about asking the question "does this read come from transcript A or transcript B (or, if ambiguous, how likely is one vs. the other)"? So when I care about gene/transcript numbers (as well as speed!), I tend to use kallisto.

STAR can produce alignments, even outside your transcriptome-of-interest, (with a more comprehensive BAM file produced) and can identify novel splice junctions.

Anyway, in answer to your question, it has to do with how ambiguity is handled. By default in STAR, ambiguous reads (in instances when one gene overlaps another gene and a read happens to fall in such a region or when a read aligns equally well to two places in the genome) are not counted. FeatureCounts partly solves this problem with that multiOverlap mode by saying "ok, the read is aligned to both gene A and gene B, let's give just give a count to both genes for that read".

kallisto uses a probabilistic mixture model (based on expectation-maximization) to say "hmm, I see 100 reads that map to both gene A and gene B; but I also see quite a few reads that map exclusively to gene A, so, given that gene A is more likely, I'll give gene A a count of 75 and gene B a count of 25". Another thing: kallisto can figure out quantifications by looking at splice junctions (if genes A and B overlap but a read crosses a splice junction in gene A; the read will go to gene A). To my knowledge, featureCounts (at least by default) uses a union-of-exon approach so it doesn't take such things into account (caveat: I don't really use featureCounts so I can't say definitively if this is true).

Anyways, it's late here where I live, but I think what I've said above accounts for the discrepancy you're observing :)

Additional note 1: If you want to get STAR to behave like kallisto, you can use a more advanced read assignment/quantification algorithm with STAR (e.g. as implemented in RSEM, and salmon can also take in the STAR BAM files to produce quantifications).

Additional note 2: If you have some reads that map to a gene as well as to some other unannotated region of the genome (edit: but slightly better), STAR in its default quantMode won't count it. kallisto may or may not depending on how you created the index (kallisto has the option to index "distinguishing flanking k-mers" to identify on-transcriptome mappings and off-transcriptome mappings should they overlap).