EDIT: I managed to obtain the IR events. I used R and IRanges. I classified an event as intron retention if each base of the intron had a read mapped on to it. Basically, it should have 100% coverage. Another (more) conservative way of doing it would be, I think, to take those introns only where there are reads that start at every position of the intron; but in our case this was too conservative. Thanks again!
Hi, I have RNA-Seq data (for Tomato) from which I used tophat to obtain alignments to the Tomato genome. From the junctions file and the gff3 gene file I was able to identify normally spliced sites, alternative 5', alternative 3' splicing and exon skipping events.
Procedure: I created two two-way hash of junctions (introns) from the gff3 gene file.
1a) hash{start_junction} = end_junction
1b) hash{end_junction} = start_junction
2a) hash{this_start_junction} = next_start_junction
2b) hash{next_start_junction} = this_start_junction
From here, given at least 1 junction that exists in this hash, I can find out if its any of the normal, alternative 5', alternative 3' or exon skipping events. I also look for a junctions with threshold (+/- 5), so that some errors due to mapping are taken care of.
However, in case of alternative 5' and 3' splice junctions, the start and end points should occur in between exon/intron, I suppose and so its not that straightforward to identify.
In case of intron retention, they wouldn't be in the junction file output of tophat and so this would be an entirely different issue and I am not quite sure how to proceed other than looking at the expression levels.
Has anyone done this before? Are there any existing tools that can do this? Or if there are any other ways of going about identifying intron retention and alternative 5' and 3' junctions. I would really appreciate it.
Thank you,
Arun.
Since you have already run tophat on your data, why don't you run cufflinks on it ? It will construct the alternative transcripts from it which you can query for alternative 5' / 3' splice sites.
Keep in mind that you can get false positives for intron retention if your samples were contaminated with genomic DNA, which is never spliced. If you want to study intron retention, you need to have some idea of what fraction of your sample is DNA. In the worst case I've seen, an "RNA"-Seq sample was actually 85% genomic DNA.
I don't follow your question fully, but it seems to me you can have one-to-many for so, e.g.:
hash{start_junction} = [end_junc1, end_junc2, ...]
What is your desired output?
I have a sam (converted from bam) file output which is aligned using tophat. It contains reads that match perfectly and reads that are mapped across junctions. I would be able to identify them with the CIGAR string in the SAM file. I would like to identify all forms of splicing from it.
Thanks Ryan. Yes, it is very much possible. I categorize them as intron retention event only when the whole intron is covered by reads; i.e, each base has at least one read mapped on to them. I guess this is pretty conservative measure to minimize the false positives...