I tried running tophat2 with default parameters and got a number of junctions that were way too long for my reference genome, which is Arabidopsis. This image illustrates this - notice the features with tophat-style junction names.
ADD COMMENT
• link
updated 3.2 years ago by
Ram
44k
•
written 10.6 years ago by
Ann
★
2.4k
2
Entering edit mode
you can use the option -I / --max-intron-length, to set a threshold for the intron length (default is 500000).
do you have a comprehensive annotation of the genes (GTF file)? if so, you can provide it to tophat to improve the alignment with the option -G and eventually you can exclusively align the reads to it
ADD REPLY
• link
updated 4.9 years ago by
Ram
44k
•
written 10.6 years ago by
Martombo
★
3.1k
Thanks to Martombo for the comment! The comment is right on target. Setting -I to some reasonable fixes this.
I think it's interesting that some of these ridiculously long introns have many thousands of supporting reads. I wonder if there is something biological meaningful going on here? Maybe tophat is picking up on duplications? Or it could just be an artifact of alignment and not at all relevant to anything.
If you have some time on your hands and wants to explore, I've published the data on Galaxy as https://usegalaxy.org/u/aloraine/h/cold-stress-in-arabidopsis. You can import it into your history and then click the "View in IGB" links next to the Junction files to see the junctions. Once there, set the label field to "score" and IGB will display the number of supporting reads on top of each junction feature. You can also use the "color by score" feature to make the high-scoring junctions more obvious.
ADD COMMENT
• link
updated 4.9 years ago by
Ram
44k
•
written 10.6 years ago by
Ann
★
2.4k
0
Entering edit mode
I have seen Tophat likes to link pseudogenes; it will place reads wherever bowtie says they fit well. If a gene is duplicated, the reads will fit in both places, and can probabilitstically get assigned half of the reads. This will underestimate the true gene's expression, but hopefully in a consistent way between cases and control.
you can use the option
-I / --max-intron-length
, to set a threshold for the intron length (default is 500000).do you have a comprehensive annotation of the genes (GTF file)? if so, you can provide it to tophat to improve the alignment with the option
-G
and eventually you can exclusively align the reads to it