Hi,
This may be a simple question but as far as I can tell the current tools don't quite do this. We have been using target RNA-seq to test the Junction abundance of certain genes, sadly due to probe limitations we were only able to do the junctions, very little of the exons are read and this causes difficult for many annotating tools (or I'm not using them correctly).
Basically all I want to do is take my genome coordinates from my alignment (STAR-2pass) and then annotate them as junctions (not transcripts). i.e. 1-2 or 1-3 exon junctions etc. Are there any tools out there that people know of that will annotate all the unique reads to exons they are contained in?
If not I will have to write some script for this, which will take me some time, and my be useful (limited) if no tools exists currently.
One of the output-files of STAR is the *SJ.out.tab which contains the position and inter alia the number of uniquely mapping reads crossing the junction. You can easily re-format the table to a bed format.
I must not have been clear in my original description. I want to use the coordinates from the SJ.out.tab file and use the start coordinate to return (from a GTF file) that exon it is anchored (i.e exon2) and then do the same for the stop coordinate. Basically I want to change my SJ.out.file to have these extra columns:
I have included that gene as a way to to a quick a very dirty check. There may be tweaks on this.
This is quite complex due to the fact that an exon cannot only be identified by one junction. Please take a look at e.g. ApoE detecting the last junction does not tell you which last exon is used or even the ratios.
You can try to run Cufflinks or StringTie on the reads and see if they can separate the transcripts by the junctions.
Alternatively, you need for identifying two exons three junctions:
Thus, you need to check for each exon if it is flanked by junction reads. Afterwards, you can assemble the selected exons by the
SJ.out.tab
file.Yes I am beginning to find this more and more complicated the further I get into this. I will create and graphic to help explain this in a bit more detail- We have used the TruSeq Targeted RNA sequencing (illumina) where we have selected the probes for each junctions of our genes of interest. This gives us only reads that span junctions, and very little of each exon. So low coverage, which makes it an issue for cufflinks. What I am going to have to do, is get from (likely the SAM file) the first base (chromosomal location) of the read and the last base before the split. Use this information against a GTF file to see which exon it maps to (become difficult with multiple transcripts) and then do the same for the second half of the read.
I was hoping, smarter people then I have already developed a tool for this, but because of the very specific nature of this it does not appear so. If you have any better ideas please do tell me, this method (target RNA seq) is all very new too me.