Question

Diverging read alignment and counts in TopHat vs. HISAT2

1

Entering edit mode

6.8 years ago

Thomas ▴ 30

Hi Everyone,

We have commissioned RNA-seq and analysis by a company, which provided us with raw fastq files, BAM files, and a count matrix. They used hard clipping and Tophat for the alignment to GRCm38/Mm10. I have attempted to recreate their analysis with HISAT2 (same reference genome), using simply the default parameters and no separate trimming/clipping. I have used samtools to convert the SAM files to BAM files and compared the results from the company's analysis ("Tophat") with my own ("HISAT2") using IGV. The results are very confusing to me. The majority of genes I have (randomly) inspected look highly similar between both sets of BAM files. See this example gene (Tophat in blue, HISAT2 in red):

Tnf

So far, so good. However, there are also multiple instance where one analysis picked up good reads, while the other did not. This is true in both directions. See these two example genes:

Gapdh

Il12b

And, finally, there are some genes in which one alignment just looks weirdly skewed. For instance:

Ubc

Does anyone know what might account for these differences? Or which alignment I should use for downstream analysis? I'd be grateful for any feedback!

Thomas

RNA-Seq genome alignment • 2.2k views

ADD COMMENT • link 6.7 years ago by Thomas ▴ 30

0

Entering edit mode

Please consider this.

Please stop using Tophat https://t.co/Es4ohxOEyx Cole and I developed the method in *2008*. It was greatly improved in TopHat2 then HISAT & HISAT2. There is no reason to use it anymore. I have been saying this for years yet it has more citations this year than last #methodsmatter
— Lior Pachter (@lpachter) December 2, 2017

You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.

ADD REPLY • link 6.8 years ago by GenoMax 148k

0

Entering edit mode

But even with that in mind, it's still odd that something like Gapdh (non-repetitive sequence that is well annotated) would be so different.

ADD REPLY • link 6.8 years ago by igor 13k

0

Entering edit mode

Have a look at GAPDH in the UCSC genome browser, don't forget to activate the track with segmental duplications ("Segmental Dups").

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

It's clean for mm10 (I am assuming this is mouse based on the genes in the screenshot).

ADD REPLY • link 6.7 years ago by igor 13k

0

Entering edit mode

Yes, it's GRCm38/Mm10. I should have mentioned that (just edited the original entry to include the information). So you think this is an acceptable level of quality for Mm10 alignments?

ADD REPLY • link 6.7 years ago by Thomas ▴ 30

0

Entering edit mode

There is not enough information in the original post to draw a solid conclusion. Yes there appear to be differences but we have no idea if one or both analyses have some problems associated with them.

ADD REPLY • link 6.7 years ago by GenoMax 148k

0

Entering edit mode

Right, exactly my problem! I can't tell if either of these is better than the other, if one is seriously flawed or if both of them are acceptable. What additional information would be helpful in determining this?

ADD REPLY • link 6.7 years ago by Thomas ▴ 30

score 2 · Accepted Answer · 2018-05-05

2

Entering edit mode

6.7 years ago

Thomas ▴ 30

Apparently, this problem is inherent to IGV. When we ran the same files that showed no reads on a different operating system, everything looked just fine. There are, of course, still differences between the Tophat and the HISAT2 alignments, but nothing is missing altogether. I will try to switch to the UCSC Genome Browser, as suggested by WouterDeCoster. Thank you all for your input!

ADD COMMENT • link 6.7 years ago by Thomas ▴ 30

0

Entering edit mode

IGV has certain defaults about what data it shows and how much of the data it shows. You could look into those under preferences. There are many things you can change and perhaps there are differences in sam tags that are causing this odd display behavior. Something to consider.

ADD REPLY • link 6.7 years ago by GenoMax 148k