Question

why tophat finds overlong introns?

0

Entering edit mode

10.6 years ago

Ann ★ 2.4k

I tried running tophat2 with default parameters and got a number of junctions that were way too long for my reference genome, which is Arabidopsis. This image illustrates this - notice the features with tophat-style junction names.

See: https://www.dropbox.com/s/tbvwdyhstivuq54/OverlyLongIntronsForBioStar.png

Note there do not seem to be a lot of these extra-big introns, but they are very noticeable in a genome browser!

Questions:

Have other people seen this and if yes, how do you handle it?
Is there a good maximum intron size that allows tophat2 to find the real introns but keeps it from finding these clearly wrong ones?
And, could the "wrong" introns be biologically interesting?

alignment RNA-Seq splicing tophat • 2.8k views

ADD COMMENT • link updated 3.2 years ago by Ram 44k • written 10.6 years ago by Ann ★ 2.4k

2

Entering edit mode

you can use the option -I / --max-intron-length, to set a threshold for the intron length (default is 500000).

do you have a comprehensive annotation of the genes (GTF file)? if so, you can provide it to tophat to improve the alignment with the option -G and eventually you can exclusively align the reads to it

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.6 years ago by Martombo ★ 3.1k

Ram · Answer 1 · 2014-04-23

Thanks to Martombo for the comment! The comment is right on target. Setting -I to some reasonable fixes this.

I think it's interesting that some of these ridiculously long introns have many thousands of supporting reads. I wonder if there is something biological meaningful going on here? Maybe tophat is picking up on duplications? Or it could just be an artifact of alignment and not at all relevant to anything.

If you have some time on your hands and wants to explore, I've published the data on Galaxy as https://usegalaxy.org/u/aloraine/h/cold-stress-in-arabidopsis. You can import it into your history and then click the "View in IGB" links next to the Junction files to see the junctions. Once there, set the label field to "score" and IGB will display the number of supporting reads on top of each junction feature. You can also use the "color by score" feature to make the high-scoring junctions more obvious.