STAR vs. Kallisto
1
1
Entering edit mode
11 days ago
gogeni5529 ▴ 70

I was wondering how do I know when to use STAR to map my reads to the genome and when it is better to use kallisto to align the reads. From reading about the two I think, I understand the difference between the two tools.

I have a data set, where we knockout a specific gene. After sequencing I expected to see a huge different between the WT and the KO, but this was visible only in the kallisto-quantified data. Using STAR-FeatureCounts found almost no reads mapped to this gene. When running FeatureCounts with multiOverlap parmeter

rowname Sample_1    Sample_2    Sample_3    Sample_13   Sample_14   Sample_15
STAR_1         2            2          2            6           0           1
Kallisto    1956        2164        2429            1           0           1
multiOl     6171        6603        4355              353         548         469

When looking at the reds in the bam files, there is a clear difference between the expression of my gene of interest between the two samples ( in the image below I show the difference between samples 1 and 13). The gtf file show there are two genes in this region and from I can see, the left part of the reads can be mapped to the first gene (red squae), but on the right-hand side these reads are clearly mapped to my gene of interest (in green square below).

The point of my question here is to understand why STAR doesn't find the same behavior as kallisto and maybe even more important is it possible to set STAR in such a way to behave the same as kallisto?

I appreciate your help

gene_of_interest

STAR kallisto alignment RNA-Seq mapping • 695 views
ADD COMMENT
4
Entering edit mode
11 days ago
dsull ★ 7.5k

I use both STAR and kallisto extensively (and I help develop the latter).

kallisto is nice when you care about asking the question "does this read come from transcript A or transcript B (or, if ambiguous, how likely is one vs. the other)"? So when I care about gene/transcript numbers (as well as speed!), I tend to use kallisto.

STAR can produce alignments, even outside your transcriptome-of-interest, (with a more comprehensive BAM file produced) and can identify novel splice junctions.

Anyway, in answer to your question, it has to do with how ambiguity is handled. By default in STAR, ambiguous reads (in instances when one gene overlaps another gene and a read happens to fall in such a region or when a read aligns equally well to two places in the genome) are not counted. FeatureCounts partly solves this problem with that multiOverlap mode by saying "ok, the read is aligned to both gene A and gene B, let's give just give a count to both genes for that read".

kallisto uses a probabilistic mixture model (based on expectation-maximization) to say "hmm, I see 100 reads that map to both gene A and gene B; but I also see quite a few reads that map exclusively to gene A, so, given that gene A is more likely, I'll give gene A a count of 75 and gene B a count of 25". Another thing: kallisto can figure out quantifications by looking at splice junctions (if genes A and B overlap but a read crosses a splice junction in gene A; the read will go to gene A). To my knowledge, featureCounts (at least by default) uses a union-of-exon approach so it doesn't take such things into account (caveat: I don't really use featureCounts so I can't say definitively if this is true).

Anyways, it's late here where I live, but I think what I've said above accounts for the discrepancy you're observing :)

Additional note 1: If you want to get STAR to behave like kallisto, you can use a more advanced read assignment/quantification algorithm with STAR (e.g. as implemented in RSEM, and salmon can also take in the STAR BAM files to produce quantifications).

Additional note 2: If you have some reads that map to a gene as well as to some other unannotated region of the genome (edit: but slightly better), STAR in its default quantMode won't count it. kallisto may or may not depending on how you created the index (kallisto has the option to index "distinguishing flanking k-mers" to identify on-transcriptome mappings and off-transcriptome mappings should they overlap).

ADD COMMENT
0
Entering edit mode

thanks dsull for the comprehensive responses both in the answer as well as in the comments. I appreciate the time.

A comment to your note #2 - I am not sure what the parameter is you've mentioned here, as I don't see it neither in the kb ref nor in the kallisto index help and I'm using the newest versions (0.28 and 0.50.1 respectively). In my case I anyway downloaded the provided mouse_index_standard file from the github repository. Also there, I can't find any mentioning of this destinction between the two. Can you please tell me which one you mean?

After reading the answer from ATpoint I have tested STAR-> Salmon and got similar results to those of the kallisto run. I don't think the STAR alignment is wrong or that STAR is doing a bad job here. I always liked using STAR and have a set-up pipeline for that. This is the main reason I used it here as well. I agree the issue here is not the alignment but the quantification of the aligned reads, which the salmon results only enhance. I am just surprised, that the difference this time is so extreme.

Have a good night

ADD REPLY
1
Entering edit mode

Of course! Oh, the parameter is --d-list. By default, in kb ref, --d-list is set to whatever FASTA you supply to kb ref but you can disable it by setting --d-list=None. (The option also exists in kallisto index although --d-list defaults to None in kallisto index). If you disable it, you won't process those "distinguishing flanking k-mers".

And yup, honestly, not too surprised at how extreme the result is. It's really a simple explanation: Multimappers are given a count of 0 in one situation but a count >0 in another. kallisto, salmon, STAR->RSEM, and STAR->salmon will all reliably fix this.

ADD REPLY

Login before adding your answer.

Traffic: 1649 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6