Question

Large Count Discrepancy of Key Gene Between STAR and HISAT2

0

Entering edit mode

3.9 years ago

dk0319 ▴ 70

Has anyone with experience using both HISAT2 and STAR ever observed a large difference (0 vs 400+) in the counts for a gene depending on which aligner they used?

I processed some RNA-seq data first using STAR and used quantmode to get the counts. Afterwards I checked a few genes that are known to be regulated by the target protein of interest that was inhibited with a drug. Most of the genes looked fine (i.e. down in treated and up in vehicle). But, one gene called IFI27 had a 0 count across all samples, I confirmed this result using several other counting programs (htseq and feature counts). I then aligned using hisat2 and counted with both htseq and featurecounts, which produced the expected result for IFI27 with more than 400 counts for the vehicles and 17 counts for the treated samples. Other genes had small differences but nothing close to the difference for IFI27.

So to me it makes sense to use hisat2 for my mapping. However, I am bothered by the fact that STAR is highly regarded but somehow could not handle this specific gene. I was curious to see if anyone else has encountered this or has any thoughts on what might be happening.

Update: I recently mapped and quantified with Salmon and again this produced 400+ counts for IFI27

STARindex Command

module load star/2.7.3a
    STAR --runThreadN 23 --runMode genomeGenerate --genomeDir /home --genomeFastaFiles /home/RNA/GRCh38.p13_genomic.fna --sjdbGTFfile /home/RNA/GRCh38.p13_genomic.gtf --sjdbOverhang 149

STARalign Command

STAR \
--outSAMattributes All \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts \
--readFilesCommand gunzip -c \
--runThreadN $Threads \
--sjdbGTFfile $GTFFILE \
--outReadsUnmapped Fastx \
--outMultimapperOrder Random \
--outWigType wiggle \
--genomeDir $GENOMEDIR \
--readFilesIn ${INPUTDIR}/${OUTPREFIX}_1.fq.gz ${INPUTDIR}/${OUTPREFIX}_2.fq.gz \
--outFileNamePrefix $OUTPREFIX

HISAT2index Command

hisat2-build -p 23 -f /home/refFiles/GRCh38.p13_genomic.fna GRCh38

HISAT2align Command

hisat2 -p 8 -q -x $GENOMEDIR -1 ${INPUTDIR}/${OUTPREFIX}_1.fq.gz -2 ${INPUTDIR}/${OUTPREFIX}_2.fq.gz | \

RNA-Seq alignment • 2.6k views

ADD COMMENT • link 3.9 years ago by dk0319 ▴ 70

1

Entering edit mode

Did you check the alignment results in a genome browser like IGV? That might give you some ideas about what caused the differences.

ADD REPLY • link 3.9 years ago by igor 13k

0

Entering edit mode

I will try that, I think its something with STAR, because I had counts when I quantified with Salmon

ADD REPLY • link 3.9 years ago by dk0319 ▴ 70

1

Entering edit mode

You are confounding STAR mapping with STAR quantification (and mapping and quantification in general). STAR, in particular, only counts reads to genes if the reads can be unambiguously assigned to it. It may well be the case STAR mapped many reads to the gene in question, but counted 0 reads, because the reads could be multi-mappers (e.g., there is a paralogue of the gene), or reads couldn't be unambiguously assigned to the gene (e.g. the gene overlaps with another gene).

So, as igor suggested, to check if the mappings differ between STAR and HISAT2, load both bams in IGV and browse to the region in question.

Also, HISAT2 only performs mapping, so your question is incomplete, as you didn't say how you quantified read counts in the case of HISAT2 mapping.

ADD REPLY • link 3.9 years ago by h.mon 35k

0

Entering edit mode

I double checked the counting of the bam files generated by STAR using STARquant-mode, ht-seq and featurecounts all say 0 for IFI27. I did the same for HISAT2, excluding STARquant-mode, They all gave counts of 400+. I also mapped and quantified using SALMON and TXImeta these also showed positive counts of 400+.

This suggests STAR is either failing to align the relevant reads to IFI27 or failing to count them as you stated, however that fails to explain why every counting program gives a count of 0 for the bam files from STAR, but not HISAT2, and why SALMON also shows positive counts.

I hope to check the alignment sometime this week, though I am new to IVG and have run into issues with the index not being recognized. So I will need to troubleshoot that first.

ADD REPLY • link 3.9 years ago by dk0319 ▴ 70

1

Entering edit mode

This suggests STAR is either failing to align the relevant reads to IFI27 or failing to count them

Based on the information here, I don't think you can say if STAR is failing or if the other methods are counting problematic reads.

ADD REPLY • link 3.9 years ago by igor 13k

0

Entering edit mode

I was able to use IGV to view the alignments and both the STAR and HISAT2 bam files show a large number of mapped reads for IFI27. I reran feature counts on the bam files from both mappers to double check the counts for IFI27 and got the same results as previous. So, now I am lost as to how STAR maps the reads so that none of the counting programs recognize IFI27 features, but HISAT2 maps such that they can be recognized

Is STAR known for being more stringent then other aligners?

FeatureCount Summary

Status  Hisat2_M1_Treated_2.sorted.bam  Hisat2_M1_Vehicle_2.sorted.bam  STAR_M1_Vehicle_2.sorted.bam
Assigned    18426492    18323120    18373644
Unassigned_Unmapped 539735  551410  0
Unassigned_Read_Type    0   0   0
Unassigned_Singleton    0   0   0
Unassigned_MappingQuality   0   0   0
Unassigned_Chimera  0   0   0
Unassigned_FragmentLength   0   0   0
Unassigned_Duplicate    0   0   0
Unassigned_MultiMapping 6446091 6487714 7647208
Unassigned_Secondary    0   0   0
Unassigned_NonSplit 0   0   0
Unassigned_NoFeatures   2625354 2780778 2705081
Unassigned_Overlapping_Length   0   0   0
Unassigned_Ambiguity    906053  745438  685688

ADD REPLY • link 3.9 years ago by dk0319 ▴ 70

0

Entering edit mode

both the STAR and HISAT2 bam files show a large number of mapped reads for IFI27

You should check the individual reads. Are they high quality, multi-mappers, etc? Does it look like they are actually mapping properly?

ADD REPLY • link 3.9 years ago by igor 13k

0

Entering edit mode

They appear to be spanning exon junctions. Is there a method for assessing the individual reads, besides visually inspecting them?

ADD REPLY • link 3.9 years ago by dk0319 ▴ 70

1

Entering edit mode

3.9 years ago

swbarnes2 14k

STAR is using htseq count. Personally, I'd use RSEM to do gene counts instead of what you've used. The discrepancy is likely caused by how each software deals with reads which can't be assigned unambiguously.

ADD COMMENT • link 3.9 years ago by swbarnes2 14k

0

Entering edit mode

So are you suggesting the discrepancy is occurring during the alignment or counting?

ADD REPLY • link 3.9 years ago by dk0319 ▴ 70

score 1 · Accepted Answer · 2021-01-06

After viewing the HISAT2 mapped reads with CLC software (cleaner compared to IGV), it appears that while there are a large number of reads that map to the IFI27 locus the majority, but not all, appear to be multi-mapping. When I repeated this with the STAR aligned reads, all reads mapped to IFI27 were multi mapping which would explain why STAR does not quantify any of the reads. Mapping with HISAT2 does show that there are some unique mapped reads to this locus and supports the non-zero counts obtained after quantification with htseq and feature counts. However, viewing both alignments as a tracklist shows that while STAR considers a read to not be unique, the exact same read is labeled as unique using HISAT2. I hope to be able to accurately quantify this gene using Salmon using a transcriptome index and genome decoy.

See image STAR (top) HISAT2 (bottom) yellow=multimapping blue/purple=paired reads green/red=forward/reverse single end https://ibb.co/Bf3SNkn