Hi,
I am using RSeQC to assess the quality of my ONT long-read RNAseq data, specifically the junction.saturation
module.
I have turned the ensembl gene annotations gtf file into a bed format....
- Homo_sapiens.GRCh38.112.gtf
- Homo_sapiens.GRCh38.112.bed
I've run the junction.saturation module.... first line of output was:
reading reference bed file: /Users/mattmorgan/Documents/RNAseq/Cam_Oct/Homo_sapiens.GRCh38.112.bed ... Done! Total 404168 known splicing junctions
Then, the last line for this specific bam file;
sampling 100% (5320705) splicing reads. 145545 splicing junctions. 57335 known splicing junctions. 88210 novel splicing junctions.
Number of known junctions look like they are starting to plateau at high % of total reads, with the number of maximal junctions in this file is trending towards ~ 60,000.
Given that in the instructions they say:
All (annotated) splice junctions should be rediscovered from saturated RNA-seq data
and the number of junctions likely discovered in this file is significantly lower, does this represent a problem with this sequencing file?
Or is it just that not all transcripts are expressed at all / at high enough numbers and so the '404168' known splice junctions is a theoretical maximum but way higher than what you would actually see? Or am I missing something else?
Thanks!