I would like to understand why we get this huge differences when running the pipeline STAR
-> featureCounts
and compare the results to running the kallisto quant
tool.
Attached are images of the quantification results done using these two workflows (exact commands are added below).
The gene highlighted in yellow is the one gene, we are interested in and looking at the STAR results, we were thinking that the experiment failed, but running kallisto to quantify the experiment, we can see a clear increase/decrease of the gene in question.
I would appreciate if someone can explain to me how these big difference occurs, or please help me to understand where I can look for the reason for that.
thanks
The Quantification results for the STAR
/featureCounts
are very low
while those found using kallisto are much higher.
I understand that STAR and kllisto do the alignment/mapping in a different way and that one can expect the results to be different. But this kind of change makes me worried, that something in my code might be wrong . The command to map the samples to the index are as followed (mainly default parameters) :
# STAR
STAR --runThreadN 20 --genomeDir $starIndex --sjdbGTFfile file.gtf --sjdbOverhang 100 \
--readFilesCommand zcat --readFilesIn R1_001.fastq.gz R2_001.fastq.gz \
--outSAMtype BAM SortedByCoordinate
featureCounts -T 15 -a file.gtf -t exon -p -g gene_id -M -o featureCounts.geneLevel.txt *.bam
# Kallisto
kallisto quant -i index.idx -o $base -t 12 \
R1_001.fastq.gz R2_001.fastq.gz
Is your data unstranded (if it's stranded, you can make feature counts perform strand specific counts)?
Is the RNA-seq data polyA-enriched?
What does the gene locus look like (any overlapping features)?
What is the kallisto index you are using? What gtf file are you using for feature counts?
Are the counts for other genes also hugely inconsistent?
Is the kallisto data estimated counts or TPM?
To me, it's very hard to understand about what might be causing this huge difference without more details. Also, in terms of trends, they seem similar(?) so in general the tools are performing consistently, but something specific is causing more count assignment in Kallisto.
I don't think it would be a multimapping issue since you have the M flag in feature counts, so multi-mappers should be counted. To me, the most obvious possibility is that there is an overlapping feature that is found in the GTF file, but not in the transcriptome/kallisto index. This could be a non-coding gene inside or overlapping your gene of interest. This can be solved by looking at the locus, using the overlap flag in feature counts, or possible using strandedness to deconvolute the overlapping elements if they are on opposite strands.
First thing I would want to do is consider if it is a difference in the way it's report (e.g. if kallisto is showing you TPM, then that may be the reason), then get eyes on the locus, and then consider the differences in reference files (i.e. GTF vs Kallisto index).