Question

Annotating RNA-Seq exon junctions

1

Entering edit mode

9.2 years ago

george.wiggins ▴ 10

Hi,

This may be a simple question but as far as I can tell the current tools don't quite do this. We have been using target RNA-seq to test the Junction abundance of certain genes, sadly due to probe limitations we were only able to do the junctions, very little of the exons are read and this causes difficult for many annotating tools (or I'm not using them correctly).

Basically all I want to do is take my genome coordinates from my alignment (STAR-2pass) and then annotate them as junctions (not transcripts). i.e. 1-2 or 1-3 exon junctions etc. Are there any tools out there that people know of that will annotate all the unique reads to exons they are contained in?

If not I will have to write some script for this, which will take me some time, and my be useful (limited) if no tools exists currently.

RNA-Seq Exon-Junctions STAR • 4.5k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by george.wiggins ▴ 10

1

Entering edit mode

One of the output-files of STAR is the *SJ.out.tab which contains the position and inter alia the number of uniquely mapping reads crossing the junction. You can easily re-format the table to a bed format.

ADD REPLY • link 9.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

I must not have been clear in my original description. I want to use the coordinates from the SJ.out.tab file and use the start coordinate to return (from a GTF file) that exon it is anchored (i.e exon2) and then do the same for the stop coordinate. Basically I want to change my SJ.out.file to have these extra columns:

StartExon StartGene StopExon StopGene

I have included that gene as a way to to a quick a very dirty check. There may be tweaks on this.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by george.wiggins ▴ 10

0

Entering edit mode

This is quite complex due to the fact that an exon cannot only be identified by one junction. Please take a look at e.g. ApoE detecting the last junction does not tell you which last exon is used or even the ratios.

You can try to run Cufflinks or StringTie on the reads and see if they can separate the transcripts by the junctions.

Alternatively, you need for identifying two exons three junctions:

 Junction1 -- Exon1_Start - Exon1_End -- Junction2 - Exon2_Start - Exon2_End -- Junction3

Thus, you need to check for each exon if it is flanked by junction reads. Afterwards, you can assemble the selected exons by the SJ.out.tab file.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Yes I am beginning to find this more and more complicated the further I get into this. I will create and graphic to help explain this in a bit more detail- We have used the TruSeq Targeted RNA sequencing (illumina) where we have selected the probes for each junctions of our genes of interest. This gives us only reads that span junctions, and very little of each exon. So low coverage, which makes it an issue for cufflinks. What I am going to have to do, is get from (likely the SAM file) the first base (chromosomal location) of the read and the last base before the split. Use this information against a GTF file to see which exon it maps to (become difficult with multiple transcripts) and then do the same for the second half of the read.

I was hoping, smarter people then I have already developed a tool for this, but because of the very specific nature of this it does not appear so. If you have any better ideas please do tell me, this method (target RNA seq) is all very new too me.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by george.wiggins ▴ 10

Ram · Answer 1 · 2015-10-21

0

Entering edit mode

9.2 years ago

Martombo ★ 3.2k

I've got something in python that could help you on my github. You can use the method get_splice_sites, of a Bam object. It returns a dictionary with a string defining the junction and the number of counts in the bam file. At the moment this works on any splice site present in the data, it is not based on a gtf file. You should then parse the site to assign it to a gene, eventually. To create a Bam object you just have to supply the path to it and indicate the reads_orientation.

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Martombo ★ 3.2k

0

Entering edit mode

ps: if you run tophat this information is already in the output (junctions.bed)

ADD REPLY • link 9.2 years ago by Martombo ★ 3.2k

0

Entering edit mode

I believe the junctions.bed is similar to the SJ.out.tab in which you get the coordinates for each junction site, but no information on the actual exonic (i.e. START:genex -exon2 STOP:genex-exon4) location of the start and stop of the read?

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by george.wiggins ▴ 10

0

Entering edit mode

ok I might still have something useful. one thing is not completely clear to me: do you want to have the exon coordinates or the read ones (as you write in your last comment)? The former can be achieved from a gtf file with this function, but you would still need some parsing to join the information with the counts across the junction.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.2 years ago by Martombo ★ 3.2k