why tophat finds overlong introns?
1
0
Entering edit mode
10.6 years ago
Ann ★ 2.4k

I tried running tophat2 with default parameters and got a number of junctions that were way too long for my reference genome, which is Arabidopsis. This image illustrates this - notice the features with tophat-style junction names.

See: https://www.dropbox.com/s/tbvwdyhstivuq54/OverlyLongIntronsForBioStar.png

Note there do not seem to be a lot of these extra-big introns, but they are very noticeable in a genome browser!

Questions:

  • Have other people seen this and if yes, how do you handle it?
  • Is there a good maximum intron size that allows tophat2 to find the real introns but keeps it from finding these clearly wrong ones?
  • And, could the "wrong" introns be biologically interesting?
alignment RNA-Seq splicing tophat • 2.8k views
ADD COMMENT
2
Entering edit mode

you can use the option -I / --max-intron-length, to set a threshold for the intron length (default is 500000).

do you have a comprehensive annotation of the genes (GTF file)? if so, you can provide it to tophat to improve the alignment with the option -G and eventually you can exclusively align the reads to it

ADD REPLY
1
Entering edit mode
10.6 years ago
Ann ★ 2.4k

Thanks to Martombo for the comment! The comment is right on target. Setting -I to some reasonable fixes this.

I think it's interesting that some of these ridiculously long introns have many thousands of supporting reads. I wonder if there is something biological meaningful going on here? Maybe tophat is picking up on duplications? Or it could just be an artifact of alignment and not at all relevant to anything.

If you have some time on your hands and wants to explore, I've published the data on Galaxy as https://usegalaxy.org/u/aloraine/h/cold-stress-in-arabidopsis. You can import it into your history and then click the "View in IGB" links next to the Junction files to see the junctions. Once there, set the label field to "score" and IGB will display the number of supporting reads on top of each junction feature. You can also use the "color by score" feature to make the high-scoring junctions more obvious.

ADD COMMENT
0
Entering edit mode

I have seen Tophat likes to link pseudogenes; it will place reads wherever bowtie says they fit well. If a gene is duplicated, the reads will fit in both places, and can probabilitstically get assigned half of the reads. This will underestimate the true gene's expression, but hopefully in a consistent way between cases and control.

ADD REPLY

Login before adding your answer.

Traffic: 2018 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6