I have direct RNA data mapped to the gencode transcriptome with minimap2. Finding the 'true' transcript of origin for a read is nontrivial as there are many secondary alignments with very close alignment scores to the primary. After visualising I can see some alignments are to transcripts which start further 3 prime than my alignment. However, due to the mechanism of direct RNA sequencing, the three prime ends of reads are the true end site.
I want to discard alignments to transcripts that have a 3' start site over 100nt prior to my read start site.
I've thought about simply extracting TES from the gencode gtf but these are genomic coordinates and I need to use the transcriptome mapping. Another way I've been thinking is if the query end site is over 100nt of my read end site, to discard the alignment. But I am not sure how to do this, any ideas? Thanks.
Did you end up solving this issue? I am facing it now... direct RNA sequencing is tough!
the problem looks simple but I would need a example bam with a few reads to test.
are u the dude from jvarkit? cuz if so, nice work bro