Question

Understanding Differing results between STAR and kallisto

0

Entering edit mode

9 months ago

Assa Yeroslaviz ★ 1.9k

I would like to understand why we get this huge differences when running the pipeline STAR -> featureCounts and compare the results to running the kallisto quant tool.

Attached are images of the quantification results done using these two workflows (exact commands are added below).

The gene highlighted in yellow is the one gene, we are interested in and looking at the STAR results, we were thinking that the experiment failed, but running kallisto to quantify the experiment, we can see a clear increase/decrease of the gene in question.

I would appreciate if someone can explain to me how these big difference occurs, or please help me to understand where I can look for the reason for that.

thanks

The Quantification results for the STAR/featureCounts are very low

STAR

while those found using kallisto are much higher.

kallisto

I understand that STAR and kllisto do the alignment/mapping in a different way and that one can expect the results to be different. But this kind of change makes me worried, that something in my code might be wrong . The command to map the samples to the index are as followed (mainly default parameters) :

# STAR
STAR --runThreadN 20 --genomeDir $starIndex --sjdbGTFfile file.gtf --sjdbOverhang 100 \
     --readFilesCommand zcat --readFilesIn R1_001.fastq.gz R2_001.fastq.gz \
     --outSAMtype BAM SortedByCoordinate
featureCounts -T 15 -a file.gtf -t exon -p -g gene_id -M -o featureCounts.geneLevel.txt *.bam

# Kallisto
   kallisto quant -i index.idx -o $base -t 12 \
        R1_001.fastq.gz R2_001.fastq.gz

quantification STAR kallisto RNA-Seq • 1.1k views

ADD COMMENT • link updated 9 months ago by rfran010 ★ 1.4k • written 9 months ago by Assa Yeroslaviz ★ 1.9k

1

Entering edit mode

Is your data unstranded (if it's stranded, you can make feature counts perform strand specific counts)?

Is the RNA-seq data polyA-enriched?

What does the gene locus look like (any overlapping features)?

What is the kallisto index you are using? What gtf file are you using for feature counts?

Are the counts for other genes also hugely inconsistent?

Is the kallisto data estimated counts or TPM?

To me, it's very hard to understand about what might be causing this huge difference without more details. Also, in terms of trends, they seem similar(?) so in general the tools are performing consistently, but something specific is causing more count assignment in Kallisto.

I don't think it would be a multimapping issue since you have the M flag in feature counts, so multi-mappers should be counted. To me, the most obvious possibility is that there is an overlapping feature that is found in the GTF file, but not in the transcriptome/kallisto index. This could be a non-coding gene inside or overlapping your gene of interest. This can be solved by looking at the locus, using the overlap flag in feature counts, or possible using strandedness to deconvolute the overlapping elements if they are on opposite strands.

First thing I would want to do is consider if it is a difference in the way it's report (e.g. if kallisto is showing you TPM, then that may be the reason), then get eyes on the locus, and then consider the differences in reference files (i.e. GTF vs Kallisto index).

ADD REPLY • link 9 months ago by rfran010 ★ 1.4k

score 3 · Accepted Answer · 2024-06-28

There are several reasons that you could be seeing expression in kallisto but not STAR. For example, this gene could have substantial (gene-level) multimapping being discarded by STAR -> featureCounts but included by kallisto. Alternatively, there could be k-mer matches leading to pseudoalignments for kallisto that cannot be extended into high quality alignments for STAR -> featureCounts.

To further investigate, one thing you might try to do is to see if properly handling the multimapping reads brings the results for this gene into greater agreement between these methods. For example, you could use the STAR -> RSEM or STAR -> salmon pipeline, which will make use of STAR's alignments, but resolve multimapping reads according to an underlying probabilistic model rather than discarding them. If these approaches agree with the high-expression, then you know that the kallisto estimate is likely correct and the depletion in STAR is due to the discarding (probably in featureCounts) of multimapping reads. On the other hand, if these other pipelines give very low expression for this gene, then perhaps this is not the issue. In that case, I would probably recommend trying to extract the reads assigned to this gene by kallisto, mapping those with STAR and seeing what happens to them (are they mapping somewhere else, or remaining unmapped entirely?).