Hi,
I aligned my RNA-Seq reads using STAR on human hg19 genome index build with ENSEMBL annotation (--sjdbGTFfile
option at the genome index generation). So in my SJ.out.tab file I will have information about the number of reads supporting splice junction as described in STAR manual :
column 1: chromosome
column 2: first base of the intron (1-based)
column 3: last base of the intron (1-based)
column 4: strand (0: undefined, 1: +, 2: -)
column 5: intron motif: 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5:AT/AC, 6: GT/AT
column 6: 0: unannotated, 1: annotated (only if splice junctions database is used)
column 7: number of uniquely mapping reads crossing the junction
column 8: number of multi-mapping reads crossing the junction
column 9: maximum spliced alignment overhang
In ENSEMBL annotation, there are all types of RNA (mRNA, tRNA, lncRNA,...). But for my research I've to compute the number of uniquely mapped reads crossing only protein-coding genes splice junction. Is there a way to download splice acceptor/donor position in hg19 for protein-coding genes in order to subtract the read count from the SJ.out.tab
file?
Thanks
I re-wrote that script to also include input via pipe, so there's no need for the intermediate files. The only thing to add is that it outputs 0-based offsets for use by HISAT internally - both GTF and SAM are 1-based formats, so perhaps take that into account.
Thanks Devon. That's perfect. I extract the splice sites and overlapped them with my data using R.