Question

How to extract reads supporting protein-coding gene splice junctions from STAR's SJ.out.tab output

1

Entering edit mode

9.7 years ago

Nicolas Rosewick 11k

Hi,

I aligned my RNA-Seq reads using STAR on human hg19 genome index build with ENSEMBL annotation (--sjdbGTFfile option at the genome index generation). So in my SJ.out.tab file I will have information about the number of reads supporting splice junction as described in STAR manual :

column 1: chromosome
column 2: first base of the intron (1-based)
column 3: last base of the intron (1-based)
column 4: strand (0: undefined, 1: +, 2: -)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang

In ENSEMBL annotation, there are all types of RNA (mRNA, tRNA, lncRNA,...). But for my research I've to compute the number of uniquely mapped reads crossing only protein-coding genes splice junction. Is there a way to download splice acceptor/donor position in hg19 for protein-coding genes in order to subtract the read count from the SJ.out.tab file?

Thanks

splice STAR • 6.2k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by Nicolas Rosewick 11k

Ram · Accepted Answer · 2015-06-16

3

Entering edit mode

9.7 years ago

Devon Ryan 105k

HISAT actually comes with a python script that can extract splice sites from a GTF file. So just take your GTF file, extract only the types of entries that you want to include, run extract_splice_sites.py on the results and then intersect the results appropriately.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by Devon Ryan 105k

1

Entering edit mode

I re-wrote that script to also include input via pipe, so there's no need for the intermediate files. The only thing to add is that it outputs 0-based offsets for use by HISAT internally - both GTF and SAM are 1-based formats, so perhaps take that into account.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.7 years ago by george.ry ★ 1.2k

0

Entering edit mode

Thanks Devon. That's perfect. I extract the splice sites and overlapped them with my data using R.

ADD REPLY • link 9.7 years ago by Nicolas Rosewick 11k