Hey all,
Our current pipeline for alignment and analysis uses TopHat + Cufflinks and we wish to replace TopHat with STAR. We are in particular interested in splice junctions so we ran STAR against a sample and compared both the junctions file from TopHat and STAR.
For some reason, TopHat seems to be reporting more junction reads than STAR, which goes against everything I've read online.
Here are the commands I used for both aligners:
STAR --genomeDir '/home/data/STAR_indexes' --runThreadN 32 --readFilesIn 1B_Stem_Cells_S42_R1_001.fastq 1B_Stem_Cells_S42_R2_001.fastq --outFileNamePrefix ./1B/1B_ --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --sjdbGTFfile /home/data/hg38_ref/hg38.refGene_gene_longest.gtf --twopass1readsN -1 --twopassMode Basic
tophat2 --b2-sensitive --no-coverage-search -G /home/data/hg38_ref/hg38_refGene_gene_longest.gtf -p 32 -o ./1A /home/data/bowtie2idx/hg38 1A_Stem_Cells_S41_R1_001.fastq 1A_Stem_Cells_S41_R2_001.fastq
Here are the first few lines of the sorted junctions file:
TopHat junctions.bed
chr1 14679 15100 JUNC00000001 1074 - 14679 15100 255,0,0 2 150,131 0,290
chr1 14902 185595 JUNC00000002 453 - 14902 185595 255,0,0 2 75,97 0,170596
chr1 14903 15907 JUNC00000003 215 - 14903 15907 255,0,0 2 135,112 0,892
chr1 15927 16728 JUNC00000004 60 - 15927 16728 255,0,0 2 20,122 0,679
chr1 15962 16714 JUNC00000005 109 - 15962 16714 255,0,0 2 65,108 0,644
chr1 15962 187236 JUNC00000006 42 - 15962 187236 255,0,0 2 65,108 0,171166
chr1 16606 187273 JUNC00000007 68 - 16606 187273 255,0,0 2 42,103 0,170564
chr1 16606 187245 JUNC00000008 54 - 16606 187245 255,0,0 2 48,69 0,170570
chr1 16716 17001 JUNC00000009 177 - 16716 17001 255,0,0 2 49,144 0,141
chr1 16857 187577 JUNC00000010 95 - 16857 187577 255,0,0 2 93,105 0,170615
chr1 16912 187577 JUNC00000011 23 - 16912 187577 255,0,0 2 104,39 0,170626
chr1 16908 17368 JUNC00000012 1277 - 16908 17368 255,0,0 2 147,136 0,324
chr1 16912 17719 JUNC00000013 24 - 16912 17719 255,0,0 2 143,114 0,693
chr1 16919 17974 JUNC00000014 13 - 16919 17974 255,0,0 2 136,60
STAR sj.out.tab
chr1 14830 14929 2 2 1 0 4 69
chr1 14830 14969 2 2 1 117 411 73
chr1 14830 15020 2 2 1 0 3 18
chr1 15039 15795 2 2 1 20 117 69
chr1 15039 186316 2 2 1 0 362 70
chr1 15948 16606 2 2 1 0 4 19
chr1 16028 16606 2 2 1 0 7 49
chr1 16766 16853 2 2 1 1 27 48
chr1 16766 16857 2 2 1 0 21 48
chr1 17056 17232 2 2 1 45 416 75
As you can see, STAR seems to be barely reporting any reads over the junctions, and I'm not sure why.
Has anyone in the community tried moving away from TopHat to STAR and encountered problems with junction reads?
If not, what are some recommended aligners for working with splice junctions?
Note that you can also run STAR in two-pass mode, which might be more accurate.
What makes you think having more junction reads is better?
I did try two-pass mode with STAR but it barely made any difference. It's just that the discrepancy between the two is so large I'm not actually sure which to trust.
Tophat has been shown to be inferior in pretty much every way. That it produces more junctions suggests that more of them are wrong.
Yes, no one should be using tophat anymore. This paper (https://www.nature.com/articles/nmeth.3317.pdf) explains the differences between the programs well and why HISAT/STAR are more superior. The makers of tophat recommend not using it anymore. Per the tophat2 website:
Is it safe to say that TopHat2's results above are wrong and should not be used for any downstream analysis?
Does anyone know of a different program that I can use to extract splice junction information? This way I have a third dataset for compare.